Re: Tika0.10 language identifier in Solr3.5.0

Ted Dunning Fri, 20 Jan 2012 09:16:35 -0800

I think you misunderstood what I am suggesting.

I am suggesting an analyzer that detects the language and then "does the
right thing" according to the language it finds.   As such, it would
tokenize and stem English according to English rules, German by German
rules and would probably do a sliding bigram window in Japanese and Chinese.


On Fri, Jan 20, 2012 at 8:54 AM, Erick Erickson <erickerick...@gmail.com>wrote:

> bq: Why not have a polyglot analyzer
>
> That could work, but it makes some compromises and assumes that your
> languages are "close enough", I have absolutely no clue how that would
> work for English and Chinese say.
>
> But it also introduces inconsistencies. Take stemming. Even though you
> could easily stem in the correct language, throwing all those stems
> into the same filed can produce interesting results at search time since
> you run the risk of hitting something produced by one of the other
> analysis chains.
>

Re: Tika0.10 language identifier in Solr3.5.0

Reply via email to