Hi Rishi, As others have indicated Multilingual search is very difficult to do well.
At HathiTrust we've been using the ICUTokenizer and ICUFilterFactory to deal with having materials in 400 languages. We also added the CJKBigramFilter to get better precision on CJK queries. We don't use stop words because stop words in one language are content words in another. For example "die" in German is a stopword but it is a content word in English. Putting multiple languages in one index can affect word frequency statistics which make relevance ranking less accurate. So for example for the English query "Die Hard" the word "die" would get a low idf score because it occurs so frequently in German. We realize that our approach does not produce the best results, but given the 400 languages, and limited resources, we do our best to make search "not suck" for non-English languages. When we have the resources we are thinking about doing special processing for a small fraction of the top 20 languages. We plan to select those languages that most need special processing and relatively easy to disambiguate from other languages. If you plan on identifying languages (rather than scripts), you should be aware that most language detection libraries don't work well on short texts such as queries. If you know that you have scripts for which you have content in only one language, you can use script detection instead of language detection. If you have German, a filter length of 25 might be too low (Because of compounding). You might want to analyze a sample of your German text to find a good length. Tom http://www.hathitrust.org/blogs/Large-scale-Search On Wed, Feb 25, 2015 at 10:31 AM, Rishi Easwaran <rishi.easwa...@aol.com> wrote: > Hi Alex, > > Thanks for the suggestions. These steps will definitely help out with our > use case. > Thanks for the idea about the lengthFilter to protect our system. > > Thanks, > Rishi. > > > > > > > > -----Original Message----- > From: Alexandre Rafalovitch <arafa...@gmail.com> > To: solr-user <solr-user@lucene.apache.org> > Sent: Tue, Feb 24, 2015 8:50 am > Subject: Re: Basic Multilingual search capability > > > Given the limited needs, I would probably do something like this: > > 1) Put a language identifier in the UpdateRequestProcessor chain > during indexing and route out at least known problematic languages, > such as Chinese, Japanese, Arabic into individual fields > 2) Put everything else together into one field with ICUTokenizer, > maybe also ICUFoldingFilter > 3) At the very end of that joint filter, stick in LengthFilter with > some high number, e.g. 25 characters max. This will ensure that > super-long words from non-space languages and edge conditions do not > break the rest of your system. > > > Regards, > Alex. > ---- > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > http://www.solr-start.com/ > > > On 23 February 2015 at 23:14, Walter Underwood <wun...@wunderwood.org> > wrote: > >> I understand relevancy, stemming etc becomes extremely complicated with > multilingual support, but our first goal is to be able to tokenize and > provide > basic search capability for any language. Ex: When the document contains > hello > or здравствуйте, the analyzer creates tokens and provides exact match > search > results. > > >