Jan's point that keeping different fields can make some statistical issues
more correct is sound.
The basic idea is that a common word in a rare language should be treated
as a common word if you are working in that language. The simplest way to
make that happen is by having a different field for
Would "doing the right thing" include firing the results at different
fields based on the language detected? Your answer to Jan
seems to indicate not, in which case my original comments
stand. The main point is that mixing all the *results* of the
analysis chains for multiple languages into a singl
Hi, This is exactly what I hope you can elaborate on - analyzer that detects
the language and then analyze accordingly. How to do that? Thank you.
Best Regards
Ni, Bing
> From: ted.dunn...@gmail.com
> Date: Fri, 20 Jan 2012 09:15:30 -0800
> Subject: Re: Tika0.10 language iden
The TF-IDF argument is a reasonable one.
On Fri, Jan 20, 2012 at 5:33 PM, Jan Høydahl wrote:
> Another benefit with separate field per lang is that TF/IDF stats gets
> correct for each individual language.
> Also if you KNOW the query language, you can target THAT field alone, but
> if you don't
Another benefit with separate field per lang is that TF/IDF stats gets correct
for each individual language.
Also if you KNOW the query language, you can target THAT field alone, but if
you don't know, you can throw the query at multiple fields, which will each get
proper analysis (at the risk o
I think you misunderstood what I am suggesting.
I am suggesting an analyzer that detects the language and then "does the
right thing" according to the language it finds. As such, it would
tokenize and stem English according to English rules, German by German
rules and would probably do a sliding
forgetting something! :)
>>
>> Otis
>>
>> Performance Monitoring SaaS for Solr -
>> http://sematext.com/spm/solr-performance-monitoring/index.html
>>
>>
>> - Original Message -
>> > From: nibing
>> > To: solr-user@lucene.apache.org
>
> http://sematext.com/spm/solr-performance-monitoring/index.html
>
>
> - Original Message -
> > From: nibing
> > To: solr-user@lucene.apache.org
> > Cc:
> > Sent: Friday, January 20, 2012 1:51 AM
> > Subject: RE: Tika0.10 language identifier in Solr3.5
Write a tokenizer that does language ID and then picks which tokenizer to
use. Then record the language in the language id field.
What is there to elaborate?
On Fri, Jan 20, 2012 at 1:58 AM, nibing wrote:
> But then there occurs a problem of using analyzer in indexing. I assume
> files encoded
erformance Monitoring SaaS for Solr -
http://sematext.com/spm/solr-performance-monitoring/index.html
- Original Message -
> From: nibing
> To: solr-user@lucene.apache.org
> Cc:
> Sent: Friday, January 20, 2012 1:51 AM
> Subject: RE: Tika0.10 language identifier in Solr3.5.
...@gmail.com
> Date: Fri, 20 Jan 2012 03:55:48 +
> Subject: Re: Tika0.10 language identifier in Solr3.5.0
> To: solr-user@lucene.apache.org
>
> Normally this is done by putting a field on each document rather than
> separating the documents into separate corpora. Keeping
s
Ni, Bing
> From: ted.dunn...@gmail.com
> Date: Fri, 20 Jan 2012 03:55:48 +
> Subject: Re: Tika0.10 language identifier in Solr3.5.0
> To: solr-user@lucene.apache.org
>
> Normally this is done by putting a field on each document rather than
> separating the documents into
...@gmail.com
> Date: Fri, 20 Jan 2012 03:55:48 +
> Subject: Re: Tika0.10 language identifier in Solr3.5.0
> To: solr-user@lucene.apache.org
>
> Normally this is done by putting a field on each document rather than
> separating the documents into separate corpora. Keeping
h
> language set of results they want to look into. In general, it is just the
> same
> with the language supported by google. Do you have some suggestions if I want
> to
> achieve multilingual search described as above? Thank you.
> Best Regards
> Ni, Bing
>
>
Do you have some suggestions if I want to achieve
> multilingual search described as above? Thank you.
> Best Regards
> Ni, Bing
>
> > Subject: Re: Tika0.10 language identifier in Solr3.5.0
> > From: jan@cominvent.com
> > Date: Thu, 19 Jan 2012 12:31:01 +0100
by google. Do you have some suggestions if I want
to achieve multilingual search described as above? Thank you.
Best Regards
Ni, Bing
> Subject: Re: Tika0.10 language identifier in Solr3.5.0
> From: jan@cominvent.com
> Date: Thu, 19 Jan 2012 12:31:01 +0100
> T
Hi,
You may use the string as you choose, for instance filtering (fq=language_s:en)
or for faceting (facet.field=language_s). What are you looking to do?
What would you like to detect on the query side? The language of the search
string? That is very hard since people type very few words into t
.
---
-Messaggio originale-
From: nibing
Sent: Thursday, January 19, 2012 9:22 AM
To: solr-user@lucene.apache.org
Subject: Tika0.10 language identifier in Solr3.5.0
Hi, all,
I am using Solr3.5.0 which applies Tika0.10 to do language detection,
and
Hi, all,
I am using Solr3.5.0 which applies Tika0.10 to do language detection,
and I have a couple of questions about this function.
1. I can see the outcome of the language detection in a field
"language_s". But what action will be taken according to the different
language code? How to c
19 matches
Mail list logo