RE: Tika0.10 language identifier in Solr3.5.0

nibing Sun, 22 Jan 2012 03:11:02 -0800

Hi,  This is exactly what I hope you can elaborate on - analyzer that detects 
the language and then analyze accordingly. How to do that?  Thank you.


Best Regards
Ni, Bing  

 > From: ted.dunn...@gmail.com
> Date: Fri, 20 Jan 2012 09:15:30 -0800
> Subject: Re: Tika0.10 language identifier in Solr3.5.0
> To: solr-user@lucene.apache.org
> 
> I think you misunderstood what I am suggesting.
> 
> I am suggesting an analyzer that detects the language and then "does the
> right thing" according to the language it finds.   As such, it would
> tokenize and stem English according to English rules, German by German
> rules and would probably do a sliding bigram window in Japanese and Chinese.
> 
> On Fri, Jan 20, 2012 at 8:54 AM, Erick Erickson 
> <erickerick...@gmail.com>wrote:
> 
> > bq: Why not have a polyglot analyzer
> >
> > That could work, but it makes some compromises and assumes that your
> > languages are "close enough", I have absolutely no clue how that would
> > work for English and Chinese say.
> >
> > But it also introduces inconsistencies. Take stemming. Even though you
> > could easily stem in the correct language, throwing all those stems
> > into the same filed can produce interesting results at search time since
> > you run the risk of hitting something produced by one of the other
> > analysis chains.
> >

RE: Tika0.10 language identifier in Solr3.5.0

Reply via email to