Re: Tika0.10 language identifier in Solr3.5.0

Otis Gospodnetic Fri, 20 Jan 2012 07:30:28 -0800

Ni, Bing

I believe you will need to pre-define fields for all languages you want to 
handle and specify an appropriate language-specific analyzer for each of those 
fields.
This also means that if you encounter a new language, you will need to adjust 
your schema to support a new language.  Of course, your Lang Identifier will 
also need to be trained to recognize that language.


I know ElasticSearch allows you to specify a field analyzer at index time, 
which is very handy.  I don't think you can do that with Solr today, but maybe 
I'm forgetting something! :)

Otis 
----
Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html 


----- Original Message -----
> From: nibing <nibing_...@hotmail.com>
> To: solr-user@lucene.apache.org
> Cc: 
> Sent: Friday, January 20, 2012 1:51 AM
> Subject: RE: Tika0.10 language identifier in Solr3.5.0
> 
> 
> Hi, Ted Dunning, 
> 
> Thank you for your reply. I can understand your point on putting a 
> "language_s" field and then keeping all the files together, which 
> speed-up searching. 
> 
> But then there occurs a problem of using analyzer in indexing. I assume files 
> encoded in different language should be handled using different analyzers, 
> i.e. 
> different tokenizers and filters. Can you elaborate a little bit on the 
> design 
> that you propose, especially in how files encoded in different languages can 
> be 
> handled during the indexing?
> Thank you. 
> 
> Best Regards
> 
> Ni, Bing  
> 
> 
>>  From: ted.dunn...@gmail.com
>>  Date: Fri, 20 Jan 2012 03:55:48 +0000
>>  Subject: Re: Tika0.10 language identifier in Solr3.5.0
>>  To: solr-user@lucene.apache.org
>> 
>>  Normally this is done by putting a field on each document rather than
>>  separating the documents into separate corpora.  Keeping them together
>>  makes the final search faster.
>> 
>>  At query time, you can add all of the language keys that you think are
>>  relevant based on your language id applied to the query and then group the
>>  results on those keys so that users can inspect different language results.
>>   If you require the correct language key, you should get pretty good
>>  retrieval speed.
>> 
>>  On Fri, Jan 20, 2012 at 3:35 AM, nibing <nibing_...@hotmail.com> 
> wrote:
>> 
>>  >
>>  > Hi, Jan Høydahl  You are right.  I am hoping to detect the language of 
> a
>>  > query, so that the serarching can be done according to the language
>>  > detected. Since people often type a few words, which is too few to 
> detect,
>>  > then it is hard to do that.  Let me describe a little bit about the 
> solr
>>  > server in my design. It consists of several cores, corresponding to 
> the
>>  > several languages, which is built during indexing. Since language 
> detection
>>  > in indexing can be done with Tika identifier, then we are currently 
> OK. But
>>  > the problem is about searching. I want to do language detection first
>>  > before do searching in the individual cores. In the case that 
> detection
>>  > result is ambiguous and several languages are returned, we probably 
> returns
>>  > a set of results, and let user to decide which language set of results 
> they
>>  > want to look into. In general, it is just the same with the language
>>  > supported by google. Do you have some suggestions if I want to achieve
>>  > multilingual search described as above?  Thank you.
>>  > Best Regards
>>  > Ni, Bing
>>  >
>>  >  > Subject: Re: Tika0.10 language identifier in Solr3.5.0
>>  > > From: jan....@cominvent.com
>>  > > Date: Thu, 19 Jan 2012 12:31:01 +0100
>>  > > To: solr-user@lucene.apache.org
>>  > >
>>  > > Hi,
>>  > >
>>  > > You may use the string as you choose, for instance filtering
>>  > (fq=language_s:en) or for faceting (facet.field=language_s). What are 
> you
>>  > looking to do?
>>  > >
>>  > > What would you like to detect on the query side? The language of 
> the
>>  > search string? That is very hard since people type very few words into 
> the
>>  > search box.
>>  > >
>>  > > --
>>  > > Jan Høydahl, search solution architect
>>  > > Cominvent AS - www.cominvent.com
>>  > > Solr Training - www.solrtraining.com
>>  > >
>>  > > On 19. jan. 2012, at 09:22, nibing wrote:
>>  > >
>>  > > >
>>  > > > Hi, all,
>>  > > >
>>  > > >
>>  > > >
>>  > > > I am using Solr3.5.0 which applies Tika0.10 to do language 
> detection,
>>  > > > and I have a couple of questions about this function.
>>  > > >
>>  > > >
>>  > > >
>>  > > > 1. I can see the outcome of the language detection in a 
> field
>>  > > > "language_s". But what action will be taken 
> according to the different
>>  > > > language code? How to configure?
>>  > > >
>>  > > >
>>  > > >
>>  > > > 2. Currently the language detection only happens in 
> indexing. Is it
>>  > > > possible to use the function in searching as well? How to 
> configure?
>>  > > >
>>  > > >
>>  > > >
>>  > > > Many thanks.
>>  > > >
>>  > > >
>>  > > > Best Regards
>>  > > >
>>  > > > Ni, Bing
>>  > > >
>>  > >
>>  >
>

Re: Tika0.10 language identifier in Solr3.5.0

Reply via email to