Hi Ilia, I don't know if it would be helpful but below I've listed some academic papers on this issue of how best to deal with mixed language/mixed script queries and documents. They are probably taking a more complex approach than you will want to use, but perhaps they will help to think about the various ways of approaching the problem.
We haven't tackled this problem yet. We have over 200 languages. Currently we are using the ICUTokenizer and ICUFolding filter but don't do any stemming due to a concern with overstemming (we have very high recall, so don't want to hurt precision by stemming) and the difficulty of correct language identification of short queries. If you have languages where there is only one language per script however, you might be able to do much more. I'm not sure if I'm remembering correctly but I believe some of the stemmers such as the Greek stemmer will pass through any strings that don't contain characters in the Greek script. So it might be possible to at least do stemming on some of your languages/scripts. I'll be very interested to learn what approach you end up using. Tom ------ Some papers: Mohammed Mustafa, Izzedin Osman, and Hussein Suleman. 2011. Indexing and weighting of multilingual and mixed documents. In *Proceedings of the South African Institute of Computer Scientists and Information Technologists Conference on Knowledge, Innovation and Leadership in a Diverse, Multidisciplinary Environment* (SAICSIT '11). ACM, New York, NY, USA, 161-170. DOI=10.1145/2072221.2072240 http://doi.acm.org/10.1145/2072221.2072240 That paper and some others are here: http://www.husseinsspace.com/research/students/mohammedmustafaali.html There is also some code from this article: Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit Choudhury, and Paolo Rosso. 2014. Query expansion for mixed-script information retrieval. In *Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval* (SIGIR '14). ACM, New York, NY, USA, 677-686. DOI=10.1145/2600428.2609622 http://doi.acm.org/10.1145/2600428.2609622 Code: http://users.dsic.upv.es/~pgupta/mixed-script-ir.html Tom Burton-West Information Retrieval Programmer Digital Library Production Service University of Michigan Library tburt...@umich.edu http://www.hathitrust.org/blogs/large-scale-search On Fri, Sep 5, 2014 at 10:06 AM, Ilia Sretenskii <sreten...@multivi.ru> wrote: > Hello. > We have documents with multilingual words which consist of different > languages parts and seach queries of the same complexity, and it is a > worldwide used online application, so users generate content in all the > possible world languages. > > For example: > 言語-aware > Løgismose-alike > ຄໍາຮ້ອງສະຫມັກ-dependent > > So I guess our schema requires a single field with universal analyzers. > > Luckily, there exist ICUTokenizer and ICUFoldingFilter for that. > > But then it requires stemming and lemmatization. > > How to implement a schema with universal stemming/lemmatization which would > probably utilize the ICU generated token script attribute? > > http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html > > By the way, I have already examined the Basistech schema of their > commercial plugins and it defines tokenizer/filter language per field type, > which is not a universal solution for such complex multilingual texts. > > Please advise how to address this task. > > Sincerely, Ilia Sretenskii. >