Thank you for the replies, guys! Using field-per-language approach for multilingual content is the last thing I would try since my actual task is to implement a search functionality which would implement relatively the same possibilities for every known world language. The closest references are those popular web search engines, they seem to serve worldwide users with their different languages and even cross-language queries as well. Thus, a field-per-language approach would be a sure waste of storage resources due to the high number of duplicates, since there are over 200 known languages. I really would like to keep single field for cross-language searchable text content, witout splitting it into specific language fields or specific language cores.
So my current choice will be to stay with just the ICUTokenizer and ICUFoldingFilter as they are without any language specific stemmers/lemmatizers yet at all. Probably I will put the most popular languages stop words filters and stemmers into the same one searchable text field to give it a try and see if it works correctly in a stack. Does specific language related filters stacking work correctly in one field? Further development will most likely involve some advanced custom analyzers like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated ScriptAttribute. http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236 https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java So I would like to know more about those "academic papers on this issue of how best to deal with mixed language/mixed script queries and documents". Tom, could you please share them?