Re: Basic Multilingual search capability

Tom Burton-West Wed, 25 Feb 2015 13:35:54 -0800

Hi Rishi,

As others have indicated Multilingual search is very difficult to do well.

At HathiTrust we've been using the ICUTokenizer and ICUFilterFactory to
deal with having materials in 400 languages.  We also added the
CJKBigramFilter to get better precision on CJK queries.  We don't use stop
words because stop words in one language are content words in another.  For
example "die" in German is a stopword but it is a content word in English.

Putting multiple languages in one index can affect word frequency
statistics which make relevance ranking less accurate.  So for example for
the English query "Die Hard" the word "die" would get a low idf score
because it occurs so frequently in German.  We realize that our  approach
does not produce the best results, but given the 400 languages, and limited
resources, we do our best to make search "not suck" for non-English
languages.   When we have the resources we are thinking about doing special
processing for a small fraction of the top 20 languages.  We plan to select
those languages  that most need special processing and relatively easy to
disambiguate from other languages.

If you plan on identifying languages (rather than scripts), you should be
aware that most language detection libraries don't work well on short texts
such as queries.

If you know that you have scripts for which you have content in only one
language, you can use script detection instead of language detection.

If you have German, a filter length of 25 might be too low (Because of
compounding). You might want to analyze a sample of your German text to
find a good length.

Tom

http://www.hathitrust.org/blogs/Large-scale-Search

On Wed, Feb 25, 2015 at 10:31 AM, Rishi Easwaran <rishi.easwa...@aol.com>
wrote:

> Hi Alex,
>
> Thanks for the suggestions. These steps will definitely help out with our
> use case.
> Thanks for the idea about the lengthFilter to protect our system.
>
> Thanks,
> Rishi.
>
>
>
>
>
>
>
> -----Original Message-----
> From: Alexandre Rafalovitch <arafa...@gmail.com>
> To: solr-user <solr-user@lucene.apache.org>
> Sent: Tue, Feb 24, 2015 8:50 am
> Subject: Re: Basic Multilingual search capability
>
>
> Given the limited needs, I would probably do something like this:
>
> 1) Put a language identifier in the UpdateRequestProcessor chain
> during indexing and route out at least known problematic languages,
> such as Chinese, Japanese, Arabic into individual fields
> 2) Put everything else together into one field with ICUTokenizer,
> maybe also ICUFoldingFilter
> 3) At the very end of that joint filter, stick in LengthFilter with
> some high number, e.g. 25 characters max. This will ensure that
> super-long words from non-space languages and edge conditions do not
> break the rest of your system.
>
>
> Regards,
>    Alex.
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 23 February 2015 at 23:14, Walter Underwood <wun...@wunderwood.org>
> wrote:
> >> I understand relevancy, stemming etc becomes extremely complicated with
> multilingual support, but our first goal is to be able to tokenize and
> provide
> basic search capability for any language. Ex: When the document contains
> hello
> or здравствуйте, the analyzer creates tokens and provides exact match
> search
> results.
>
>
>

Re: Basic Multilingual search capability

Reply via email to