As pointed out in the recent thread about stemmers and other language specifics I should handle them all in their own right. But how?
The first problem is how to know the language. Sometimes I have a language identifier within the record, sometimes I have more than one, sometimes I have none. How should I handle the non-obvious cases? Given I somehow know record1 is English and record2 is German. Then I need all my (relevant) fields for every language, e.g. I will have TITLE_ENG and TITLE_GER and both will have their respective stemmer. But what with exotic languages? Use a catch all "language" without a stemmer? Now a user searches for TITLE:term and I don't know beforehand the language of "term". Do I have to expand the query to something like "TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is there some sort of copyfield for analyzed fields? Then I could just copy all the TITLE_* fields to TITLE and don't bother with the language of the query. Are there any solutions that prevent an index with thousands of fields and dozens of ORed query terms? I know I will have to implement some better multilanguage support but would also like to keep it as simple as possible. -Michael