Re: Implementing custom analyzer for multi-language stemming

Sujit Pal Wed, 30 Jul 2014 12:04:53 -0700

Hi Eugene,

In a system we built couple of years ago, we had a corpus of English and
French mixed (and Spanish on the way but that was implemented by client
after we handed off). We had different fields for each language. So (title,
body) for English docs was (title_en, body_en), for French (title_fr,
body_fr) and for Spanish (title_es, body_es) - each of these were
associated with a different Analyzer (that was associated with the field
types in schema.xml, in case of Lucene you can use
PerFieldAnalyzerWrapper). Our pipeline used Google translate to detect the
language and write the contents into the appropriate field set for the
language. Our analyzers were custom - but Lucene/Solr provides analyzer
chains for many major languages. You can find a list here:


https://wiki.apache.org/solr/LanguageAnalysis

-sujit



On Wed, Jul 30, 2014 at 10:52 AM, Chris Morley <ch...@depahelix.com> wrote:

> I know BasisTech.com has a plugin for elasticsearch that extends
> stemming/lemmatization to work across 40 natural languages.
> I'm not sure what they have for Solr, but I think something like that may
> exist as well.
>
> Cheers,
> -Chris.
>
> ----------------------------------------
>  From: "Eugene" <beyondcomp...@gmail.com>
> Sent: Wednesday, July 30, 2014 1:48 PM
> To: solr-user@lucene.apache.org
> Subject: Implementing custom analyzer for multi-language stemming
>
> Hello, fellow Solr and Lucene users and developers!
>
> In our project we receive text from users in different languages. We
> detect language automatically and use Google Translate APIs a lot (so
> having arbitrary number of languages in our system doesn't concern us).
> However we need to be able to search using stemming. Having nearly hundred
> of fields (several fields for each language with language-specific
> stemmers) listed in our search query is not an option. So we need a way to
> have a single index which has stemmed tokens for different languages. I
> have two questions:
>
> 1. Are there already (third-party) custom multi-language stemming
> analyzers? (I doubt that no one else ran into this issue)
>
> 2. If I'm going to implement such analyzer myself, could you please
> suggest a better way to 'pass' detected language value into such analyzer?
> Detecting language in analyzer itself is not an option, because: a) we
> already detect it in other place b) we do it based on combined values of
> many fields ('name', 'topic', 'description', etc.), while current field
> can
> be to short for reliable detection c) sometimes we just want to specify
> language explicitly. The obvious hack would be to prepend ISO 639-1 code
> to
> field value. But I'd like to believe that Solr allows for cleaner
> solution.
> I could think about either: a) custom query parameter (but I guess, it
> will
> require modifying request handlers, etc. which is highly undesirable) b)
> getting value from other field (we obviously have 'language' field and we
> do not have mixed-language records). If it is possible, could you please
> describe the mechanism for doing this or point to relevant code examples?
> Thank you very much and have a good day!
>
>

Re: Implementing custom analyzer for multi-language stemming

Reply via email to