Re: Tika0.10 language identifier in Solr3.5.0

2012-01-23 Thread Ted Dunning
Jan's point that keeping different fields can make some statistical issues more correct is sound. The basic idea is that a common word in a rare language should be treated as a common word if you are working in that language. The simplest way to make that happen is by having a different field for

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-22 Thread Erick Erickson
Would "doing the right thing" include firing the results at different fields based on the language detected? Your answer to Jan seems to indicate not, in which case my original comments stand. The main point is that mixing all the *results* of the analysis chains for multiple languages into a singl

RE: Tika0.10 language identifier in Solr3.5.0

2012-01-22 Thread nibing
Hi, This is exactly what I hope you can elaborate on - analyzer that detects the language and then analyze accordingly. How to do that? Thank you. Best Regards Ni, Bing > From: ted.dunn...@gmail.com > Date: Fri, 20 Jan 2012 09:15:30 -0800 > Subject: Re: Tika0.10 language iden

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-20 Thread Ted Dunning
The TF-IDF argument is a reasonable one. On Fri, Jan 20, 2012 at 5:33 PM, Jan Høydahl wrote: > Another benefit with separate field per lang is that TF/IDF stats gets > correct for each individual language. > Also if you KNOW the query language, you can target THAT field alone, but > if you don't

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-20 Thread Jan Høydahl
Another benefit with separate field per lang is that TF/IDF stats gets correct for each individual language. Also if you KNOW the query language, you can target THAT field alone, but if you don't know, you can throw the query at multiple fields, which will each get proper analysis (at the risk o

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-20 Thread Ted Dunning
I think you misunderstood what I am suggesting. I am suggesting an analyzer that detects the language and then "does the right thing" according to the language it finds. As such, it would tokenize and stem English according to English rules, German by German rules and would probably do a sliding

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-20 Thread Erick Erickson
forgetting something! :) >> >> Otis >> >> Performance Monitoring SaaS for Solr - >> http://sematext.com/spm/solr-performance-monitoring/index.html >> >> >> - Original Message - >> > From: nibing >> > To: solr-user@lucene.apache.org >

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-20 Thread Ted Dunning
> http://sematext.com/spm/solr-performance-monitoring/index.html > > > - Original Message - > > From: nibing > > To: solr-user@lucene.apache.org > > Cc: > > Sent: Friday, January 20, 2012 1:51 AM > > Subject: RE: Tika0.10 language identifier in Solr3.5

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-20 Thread Ted Dunning
Write a tokenizer that does language ID and then picks which tokenizer to use. Then record the language in the language id field. What is there to elaborate? On Fri, Jan 20, 2012 at 1:58 AM, nibing wrote: > But then there occurs a problem of using analyzer in indexing. I assume > files encoded

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-20 Thread Otis Gospodnetic
erformance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html - Original Message - > From: nibing > To: solr-user@lucene.apache.org > Cc: > Sent: Friday, January 20, 2012 1:51 AM > Subject: RE: Tika0.10 language identifier in Solr3.5.

RE: Tika0.10 language identifier in Solr3.5.0

2012-01-20 Thread nibing
...@gmail.com > Date: Fri, 20 Jan 2012 03:55:48 + > Subject: Re: Tika0.10 language identifier in Solr3.5.0 > To: solr-user@lucene.apache.org > > Normally this is done by putting a field on each document rather than > separating the documents into separate corpora. Keeping

RE: Tika0.10 language identifier in Solr3.5.0

2012-01-19 Thread nibing
s Ni, Bing > From: ted.dunn...@gmail.com > Date: Fri, 20 Jan 2012 03:55:48 + > Subject: Re: Tika0.10 language identifier in Solr3.5.0 > To: solr-user@lucene.apache.org > > Normally this is done by putting a field on each document rather than > separating the documents into

RE: Tika0.10 language identifier in Solr3.5.0

2012-01-19 Thread nibing
...@gmail.com > Date: Fri, 20 Jan 2012 03:55:48 + > Subject: Re: Tika0.10 language identifier in Solr3.5.0 > To: solr-user@lucene.apache.org > > Normally this is done by putting a field on each document rather than > separating the documents into separate corpora. Keeping

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-19 Thread Otis Gospodnetic
h > language set of results they want to look into. In general, it is just the > same > with the language supported by google. Do you have some suggestions if I want > to > achieve multilingual search described as above?  Thank you. > Best Regards > Ni, Bing  > >

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-19 Thread Ted Dunning
Do you have some suggestions if I want to achieve > multilingual search described as above? Thank you. > Best Regards > Ni, Bing > > > Subject: Re: Tika0.10 language identifier in Solr3.5.0 > > From: jan@cominvent.com > > Date: Thu, 19 Jan 2012 12:31:01 +0100

RE: Tika0.10 language identifier in Solr3.5.0

2012-01-19 Thread nibing
by google. Do you have some suggestions if I want to achieve multilingual search described as above? Thank you. Best Regards Ni, Bing > Subject: Re: Tika0.10 language identifier in Solr3.5.0 > From: jan@cominvent.com > Date: Thu, 19 Jan 2012 12:31:01 +0100 > T

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-19 Thread Jan Høydahl
Hi, You may use the string as you choose, for instance filtering (fq=language_s:en) or for faceting (facet.field=language_s). What are you looking to do? What would you like to detect on the query side? The language of the search string? That is very hard since people type very few words into t

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-19 Thread Alessio Crisantemi
. --- -Messaggio originale- From: nibing Sent: Thursday, January 19, 2012 9:22 AM To: solr-user@lucene.apache.org Subject: Tika0.10 language identifier in Solr3.5.0 Hi, all, I am using Solr3.5.0 which applies Tika0.10 to do language detection, and

Tika0.10 language identifier in Solr3.5.0

2012-01-19 Thread nibing
Hi, all, I am using Solr3.5.0 which applies Tika0.10 to do language detection, and I have a couple of questions about this function. 1. I can see the outcome of the language detection in a field "language_s". But what action will be taken according to the different language code? How to c