Re: Implementing custom analyzer for multi-language stemming

Umesh Prasad Sat, 02 Aug 2014 20:07:24 -0700

Also, take a look at the Lucid revolution talk Typed Index
https://www.youtube.com/watch?v=X93DaRfi790


 *Published on 25 Nov 2013*

Presented by Christoph Goller, Chief Scientist, IntraFind Software AG

If you want to search in a multilingual environment with high-quality
language-specific word-normalization, if you want to handle mixed-language
documents, if you want to add phonetic search for names if you need a
semantic search which distinguishes between a search for the color "brown"
and a person with the second name "brown", in all these cases you have to
deal with different types of terms. I will show why it makes much more
sense to attach types (prefixes) to Lucene terms instead of relying on
different fields or even different indexes for different kinds of terms.
Furthermore I will show how queries to such a typed index look and why e.g.
SpanQueries are needed to correctly treat compound words and phrases or
realize a reasonable phonetic search. The Analyzers and the QueryParser
described are available as plugins for Lucene, Solr, and elasticsearch.




On 31 July 2014 00:34, Sujit Pal <sujit....@comcast.net> wrote:

> Hi Eugene,
>
> In a system we built couple of years ago, we had a corpus of English and
> French mixed (and Spanish on the way but that was implemented by client
> after we handed off). We had different fields for each language. So (title,
> body) for English docs was (title_en, body_en), for French (title_fr,
> body_fr) and for Spanish (title_es, body_es) - each of these were
> associated with a different Analyzer (that was associated with the field
> types in schema.xml, in case of Lucene you can use
> PerFieldAnalyzerWrapper). Our pipeline used Google translate to detect the
> language and write the contents into the appropriate field set for the
> language. Our analyzers were custom - but Lucene/Solr provides analyzer
> chains for many major languages. You can find a list here:
>
> https://wiki.apache.org/solr/LanguageAnalysis
>
> -sujit
>
>
>
> On Wed, Jul 30, 2014 at 10:52 AM, Chris Morley <ch...@depahelix.com>
> wrote:
>
> > I know BasisTech.com has a plugin for elasticsearch that extends
> > stemming/lemmatization to work across 40 natural languages.
> > I'm not sure what they have for Solr, but I think something like that may
> > exist as well.
> >
> > Cheers,
> > -Chris.
> >
> > ----------------------------------------
> >  From: "Eugene" <beyondcomp...@gmail.com>
> > Sent: Wednesday, July 30, 2014 1:48 PM
> > To: solr-user@lucene.apache.org
> > Subject: Implementing custom analyzer for multi-language stemming
> >
> > Hello, fellow Solr and Lucene users and developers!
> >
> > In our project we receive text from users in different languages. We
> > detect language automatically and use Google Translate APIs a lot (so
> > having arbitrary number of languages in our system doesn't concern us).
> > However we need to be able to search using stemming. Having nearly
> hundred
> > of fields (several fields for each language with language-specific
> > stemmers) listed in our search query is not an option. So we need a way
> to
> > have a single index which has stemmed tokens for different languages. I
> > have two questions:
> >
> > 1. Are there already (third-party) custom multi-language stemming
> > analyzers? (I doubt that no one else ran into this issue)
> >
> > 2. If I'm going to implement such analyzer myself, could you please
> > suggest a better way to 'pass' detected language value into such
> analyzer?
> > Detecting language in analyzer itself is not an option, because: a) we
> > already detect it in other place b) we do it based on combined values of
> > many fields ('name', 'topic', 'description', etc.), while current field
> > can
> > be to short for reliable detection c) sometimes we just want to specify
> > language explicitly. The obvious hack would be to prepend ISO 639-1 code
> > to
> > field value. But I'd like to believe that Solr allows for cleaner
> > solution.
> > I could think about either: a) custom query parameter (but I guess, it
> > will
> > require modifying request handlers, etc. which is highly undesirable) b)
> > getting value from other field (we obviously have 'language' field and we
> > do not have mixed-language records). If it is possible, could you please
> > describe the mechanism for doing this or point to relevant code examples?
> > Thank you very much and have a good day!
> >
> >
>



-- 
---
Thanks & Regards
Umesh Prasad

Re: Implementing custom analyzer for multi-language stemming

Reply via email to