Thank you for the replies, guys!

Using field-per-language approach for multilingual content is the last
thing I would try since my actual task is to implement a search
functionality which would implement relatively the same possibilities for
every known world language.
The closest references are those popular web search engines, they seem to
serve worldwide users with their different languages and even
cross-language queries as well.
Thus, a field-per-language approach would be a sure waste of storage
resources due to the high number of duplicates, since there are over 200
known languages.
I really would like to keep single field for cross-language searchable text
content, witout splitting it into specific language fields or specific
language cores.

So my current choice will be to stay with just the ICUTokenizer and
ICUFoldingFilter as they are without any language specific
stemmers/lemmatizers yet at all.

Probably I will put the most popular languages stop words filters and
stemmers into the same one searchable text field to give it a try and see
if it works correctly in a stack.
Does specific language related filters stacking work correctly in one field?

Further development will most likely involve some advanced custom analyzers
like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated
ScriptAttribute.
http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236
https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java

So I would like to know more about those "academic papers on this issue of
how best to deal with mixed language/mixed script queries and documents".
Tom, could you please share them?

Reply via email to