Hi Ilia, I see that Trey answered your question about how you might stack language specific filters in one field. If I remember correctly, his approach assumes you have identified the language of the query. That is not the same as detecting the script of the query and is much harder.
Trying to do language-specific processing on multiple languages, especially a large number such as the 200 you mention or the 400 in HathiTrust is a very difficult problem. Detecting language (rather than script) in short queries is an open problem in the research literature. As others have suggested, you might want to start with something less ambitious that meets most of your business needs. You also might want to consider whether the errors a stemmer might make on some queries will be worth the increase in recall that you will get on others. Concern about getting results that can confuse users is the one of the main reason we haven't seriously pursued stemming in HathiTrust full-text search. Regarding the papers listed in my previous e-mail, you can get the first paper at the link I gave and the second paper (although on re-reading it, I don't think it will be very useful) is available if you go to the link for the code and follow the link on that page for the paper. I suspect you might want to think about the differences between scripts and languages. Most of the Solr/Lucene stemmers either assume you are only giving them the language they are designed for, or work on the basis of script. This works well when there is only one language per script, but breaks if you have many languages using the same script such as the Latin-1 languages. (Because of an issue with the Solr-user spam filter and an issue with my e-mail client all the URLs except the one below have http[s] removed and/or spaces added. See this gist for all the URLS with context: https://gist.github.com/anonymous/2e1233d80f37354001a3) That PolyGlotStemming filter uses the ICUTokenizer's script identification, but there are at least 12 different languages that use the Arabic script (www omniglot com writing arabic) and over 100 that use Latin-1. Please see the list of languages and scripts at aspell. net/ man-html /Supported. html#Supported. or www. omniglot .com /writing/langalph .htm#latin As a simple example, German and English both use the Latin-1 character set. Using an English stemmer for German or a German stemmer for English is unlikely to work very well. If you try to use stop words for multiple languages you will run into difficulties where a stop word in one language is a content word in another. For example if you use German stop words such as "die", you will eliminate the English content word "die". Identifying languages in short texts such as queries is a hard problem. About half the papers looking at query language identification cheat, and look at things such as the language of the pages that a user has clicked on. If all you have to make a guess is the text of the query, language identification is very difficult. I suspect that mixed script queries are even harder (see www .transacl. org/wp-content/uploads/2014/02/38.pdf). See the papers by Marco Lui and Tim Baldwin on Marco's web page: ww2 .cs. mu. oz. au /~mlui/ In this paper they explain why short text language identification is a hard problem "Language Identification: the Long and the Short of the Matter" www .aclweb. org/anthology/N10-1027 Other papers available on Marco's page describe the design and implementation of langid.py which is a state-of-the-art language identification program. I've tried several language guessers designed for short texts and at least on queries from our query logs, the results were unusable. Both langid.py with the defaults (noEnglish.langid.gz pipe delimited) and ldig with the most recent latin.model (NonEnglish.ldig.gz tab delimited) did not work well for our queries. However, both of these have parameters that can be tweaked and also facilities for training if you have labeled data. ldig is specifically designed to run on short text like queries or twitter. It can be configured to spit out the scores for each language instead of only the highest score (default). Also we didn't try to limit the list of languages it looks for, and that might give better results. github .com /shuyo/ldig langdetect looks like its by the same programmer and is in Java, but I haven't tried it: code .google. com/p/language-detection/ langid is designed by linguistic experts, but may need to be trained on short queries. github .com/saffsd/langid.py There is also Mike McCandless' port of the Google CLD blog. mikemccandless .com/2013/08/a-new-version-of-compact-language .html code .google .com/p/chromium-compact-language-detector/source/browse/README However here is this comment: "Indeed I see the same results as you; I think CLD2 is just not designed for short text." and a similar comment was made in this talk: videolectures .net/russir2012_grigoriev_language/ If you aren't worried about false drops and your documents are relatively short and your use case favors recall over precision you might want to look at McNamee and Mayfield's work on language-independent stemming. I don't know if their n-gram approach would be feasible for your use case, but they also got good results on TREC/CLEF newswire article datasets with just truncating words. We can't use their approach because we already have a high recall/low precision situation and because our documents are several orders of magnitude larger than the TREC/CLEF/FIRE newswire articles they tested with. Paul McNamee, Charles Nicholas, and James Mayfield. 2009. Addressing morphological variation in alphabetic languages. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR '09). ACM, New York, NY, USA, 75-82. DOI=10.1145/1571941.1571957 http:// doi.acm. org/10.1145/1571941.1571957 Paul McNamee, Charles Nicholas, and James Mayfield. 2008. Don't have a stemmer?: be un+concern+ed. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '08). ACM, New York, NY, USA, 813-814. DOI=10.1145/1390334.1390518 http:// doi. acm .org/10.1145/1390334.1390518 I hope this helps. Tom On Mon, Sep 8, 2014 at 1:33 AM, Ilia Sretenskii <sreten...@multivi.ru> wrote: > Thank you for the replies, guys! > > Using field-per-language approach for multilingual content is the last > thing I would try since my actual task is to implement a search > functionality which would implement relatively the same possibilities for > every known world language. > The closest references are those popular web search engines, they seem to > serve worldwide users with their different languages and even > cross-language queries as well. > Thus, a field-per-language approach would be a sure waste of storage > resources due to the high number of duplicates, since there are over 200 > known languages. > I really would like to keep single field for cross-language searchable text > content, witout splitting it into specific language fields or specific > language cores. > > So my current choice will be to stay with just the ICUTokenizer and > ICUFoldingFilter as they are without any language specific > stemmers/lemmatizers yet at all. > > Probably I will put the most popular languages stop words filters and > stemmers into the same one searchable text field to give it a try and see > if it works correctly in a stack. > Does specific language related filters stacking work correctly in one field? > > Further development will most likely involve some advanced custom analyzers > like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated > ScriptAttribute. > > So I would like to know more about those "academic papers on this issue of > how best to deal with mixed language/mixed script queries and documents". > Tom, could you please share them?