Preparing the ground for a real multilang index

Michael Lackhoff Thu, 02 Jul 2009 11:59:38 -0700

As pointed out in the recent thread about stemmers and other language
specifics I should handle them all in their own right. But how?


The first problem is how to know the language. Sometimes I have a
language identifier within the record, sometimes I have more than one,
sometimes I have none. How should I handle the non-obvious cases?

Given I somehow know record1 is English and record2 is German. Then I
need all my (relevant) fields for every language, e.g. I will have
TITLE_ENG and TITLE_GER and both will have their respective stemmer. But
what with exotic languages? Use a catch all "language" without a stemmer?

Now a user searches for TITLE:term and I don't know beforehand the
language of "term". Do I have to expand the query to something like
"TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is there
some sort of copyfield for analyzed fields? Then I could just copy all
the TITLE_* fields to TITLE and don't bother with the language of the query.

Are there any solutions that prevent an index with thousands of fields
and dozens of ORed query terms?

I know I will have to implement some better multilanguage support but
would also like to keep it as simple as possible.

-Michael

Preparing the ground for a real multilang index

Reply via email to