Re: Implementing custom analyzer for multi-language stemming

TK Mon, 04 Aug 2014 21:11:26 -0700

On 7/30/14, 10:47 AM, Eugene wrote:

     Hello, fellow Solr and Lucene users and developers!


     In our project we receive text from users in different languages. We
detect language automatically and use Google Translate APIs a lot (so
having arbitrary number of languages in our system doesn't concern us).
However we need to be able to search using stemming. Having nearly hundred
of fields (several fields for each language with language-specific
stemmers) listed in our search query is not an option. So we need a way to
have a single index which has stemmed tokens for different languages.


Do you mean to have a Tokenizer that switches among supported languages
depending on the "lang" field? This is something I thought about when I
started working on Solr/Lucene and soon I realized it is not possible because
of the way Lucene is designed; The Tokenizer in an analyzer chain cannot peek
other field's value, or there is no way to control which field is processed
first.

If that's not what you are trying to achieve, could you tell us what
it is? If you have different language text in a single field, and if
someone search for a word common to many languages,
such as "sports" (or "Lucene" for that matter), Solr will return
the documents of different languages, most of which the user
doesn't understand. Would that be useful? If you have
a special use case, would you like to share it?

--
Kuro

Re: Implementing custom analyzer for multi-language stemming

Reply via email to