Re: Tika0.10 language identifier in Solr3.5.0

Jan Høydahl Fri, 20 Jan 2012 17:34:12 -0800

Another benefit with separate field per lang is that TF/IDF stats gets correct 
for each individual language.
Also if you KNOW the query language, you can target THAT field alone, but if 
you don't know, you can throw the query at multiple fields, which will each get 
proper analysis (at the risk of lower precision)


The only case where I would prefer having one single field for all languages is 
if my search app needs to support a large amount of languages, such as a wide 
web crawl with 100 languages crawled. The way FAST supported this was to go 
lemmatization by index expansion instead of reduction or stemming - then you 
can easily support full linguistics for 100 languages, indexed in the same 
field.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 20. jan. 2012, at 18:15, Ted Dunning wrote:

> I think you misunderstood what I am suggesting.
> 
> I am suggesting an analyzer that detects the language and then "does the
> right thing" according to the language it finds.   As such, it would
> tokenize and stem English according to English rules, German by German
> rules and would probably do a sliding bigram window in Japanese and Chinese.
> 
> On Fri, Jan 20, 2012 at 8:54 AM, Erick Erickson 
> <erickerick...@gmail.com>wrote:
> 
>> bq: Why not have a polyglot analyzer
>> 
>> That could work, but it makes some compromises and assumes that your
>> languages are "close enough", I have absolutely no clue how that would
>> work for English and Chinese say.
>> 
>> But it also introduces inconsistencies. Take stemming. Even though you
>> could easily stem in the correct language, throwing all those stems
>> into the same filed can produce interesting results at search time since
>> you run the risk of hitting something produced by one of the other
>> analysis chains.
>>

Re: Tika0.10 language identifier in Solr3.5.0

Reply via email to