Hi, This is exactly what I hope you can elaborate on - analyzer that detects the language and then analyze accordingly. How to do that? Thank you.
Best Regards Ni, Bing > From: ted.dunn...@gmail.com > Date: Fri, 20 Jan 2012 09:15:30 -0800 > Subject: Re: Tika0.10 language identifier in Solr3.5.0 > To: solr-user@lucene.apache.org > > I think you misunderstood what I am suggesting. > > I am suggesting an analyzer that detects the language and then "does the > right thing" according to the language it finds. As such, it would > tokenize and stem English according to English rules, German by German > rules and would probably do a sliding bigram window in Japanese and Chinese. > > On Fri, Jan 20, 2012 at 8:54 AM, Erick Erickson > <erickerick...@gmail.com>wrote: > > > bq: Why not have a polyglot analyzer > > > > That could work, but it makes some compromises and assumes that your > > languages are "close enough", I have absolutely no clue how that would > > work for English and Chinese say. > > > > But it also introduces inconsistencies. Take stemming. Even though you > > could easily stem in the correct language, throwing all those stems > > into the same filed can produce interesting results at search time since > > you run the risk of hitting something produced by one of the other > > analysis chains. > >