Re: Implementing custom analyzer for multi-language stemming

Rich Cariens Tue, 05 Aug 2014 08:38:49 -0700

I've started a GitHub project to try out some cross-lingual analysis ideas (
https://github.com/whateverdood/cross-lingual-search). I haven't played
over there for about 3 months, but plan on restarting work there shortly.
In a nutshell, the interesting component
("SimplePolyGlotStemmingTokenFilter") relies on ICU4J ScriptAttributes:
each token is inspected for it's script, i.e. "latin" or "arabic", and then
a "ScriptStemmer" recruits the appropriate stemmer to handle the token.


Of course this is extremely primitive and basic, but I think it would be
possible to write a CharFilter or TokenFilter that inspects the entire
TokenStream to guess the language(s), perhaps even noting where languages
change. Language and position information could be tracked, the TokenStream
rewound and then Tokens emitted with "LanguageAttributes" for downstream
Token stemmers to deal with.

Or is that a crazy idea?


On Tue, Aug 5, 2014 at 12:10 AM, TK <kuros...@sonic.net> wrote:

> On 7/30/14, 10:47 AM, Eugene wrote:
>
>>      Hello, fellow Solr and Lucene users and developers!
>>
>>      In our project we receive text from users in different languages. We
>> detect language automatically and use Google Translate APIs a lot (so
>> having arbitrary number of languages in our system doesn't concern us).
>> However we need to be able to search using stemming. Having nearly hundred
>> of fields (several fields for each language with language-specific
>> stemmers) listed in our search query is not an option. So we need a way to
>> have a single index which has stemmed tokens for different languages.
>>
>
> Do you mean to have a Tokenizer that switches among supported languages
> depending on the "lang" field? This is something I thought about when I
> started working on Solr/Lucene and soon I realized it is not possible
> because
> of the way Lucene is designed; The Tokenizer in an analyzer chain cannot
> peek
> other field's value, or there is no way to control which field is processed
> first.
>
> If that's not what you are trying to achieve, could you tell us what
> it is? If you have different language text in a single field, and if
> someone search for a word common to many languages,
> such as "sports" (or "Lucene" for that matter), Solr will return
> the documents of different languages, most of which the user
> doesn't understand. Would that be useful? If you have
> a special use case, would you like to share it?
>
> --
> Kuro
>

Re: Implementing custom analyzer for multi-language stemming

Reply via email to