Re: Basic Multilingual search capability

Alexandre Rafalovitch Tue, 24 Feb 2015 05:51:07 -0800

Given the limited needs, I would probably do something like this:

1) Put a language identifier in the UpdateRequestProcessor chain
during indexing and route out at least known problematic languages,
such as Chinese, Japanese, Arabic into individual fields
2) Put everything else together into one field with ICUTokenizer,
maybe also ICUFoldingFilter
3) At the very end of that joint filter, stick in LengthFilter with
some high number, e.g. 25 characters max. This will ensure that
super-long words from non-space languages and edge conditions do not
break the rest of your system.



Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 February 2015 at 23:14, Walter Underwood <wun...@wunderwood.org> wrote:
>> I understand relevancy, stemming etc becomes extremely complicated with 
>> multilingual support, but our first goal is to be able to tokenize and 
>> provide basic search capability for any language. Ex: When the document 
>> contains hello or здравствуйте, the analyzer creates tokens and provides 
>> exact match search results.

Re: Basic Multilingual search capability

Reply via email to