Michael, I think you really aught to know the language of the query (from a pulldown, from the browser, from user settings, somewhere) and pass that to the backend.... unless your queries are sufficiently long that their language can be identified.
Here is a handy tool for playing with language identification: http://www.sematext.com/demo/lid/ You'll see how hard it is to guess a language of very short texts. :) You really want to avoid that huge OR. Often it makes no sense to OR in multilingual context. Think about the word "die" (English and German, as you know) and what happens when you include that in an OR. And does it make sense to include a "very language specific word", say "wunderbar", in an OR that goes across multiple/all languages? Funny, they have it listed at http://www.merriam-webster.com/dictionary/wunderbar Otis-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Michael Lackhoff <mich...@lackhoff.de> > To: solr-user@lucene.apache.org > Sent: Thursday, July 2, 2009 2:58:41 PM > Subject: Preparing the ground for a real multilang index > > As pointed out in the recent thread about stemmers and other language > specifics I should handle them all in their own right. But how? > > The first problem is how to know the language. Sometimes I have a > language identifier within the record, sometimes I have more than one, > sometimes I have none. How should I handle the non-obvious cases? > > Given I somehow know record1 is English and record2 is German. Then I > need all my (relevant) fields for every language, e.g. I will have > TITLE_ENG and TITLE_GER and both will have their respective stemmer. But > what with exotic languages? Use a catch all "language" without a stemmer? > > Now a user searches for TITLE:term and I don't know beforehand the > language of "term". Do I have to expand the query to something like > "TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is there > some sort of copyfield for analyzed fields? Then I could just copy all > the TITLE_* fields to TITLE and don't bother with the language of the query. > > Are there any solutions that prevent an index with thousands of fields > and dozens of ORed query terms? > > I know I will have to implement some better multilanguage support but > would also like to keep it as simple as possible. > > -Michael