RE: Tika0.10 language identifier in Solr3.5.0

nibing Thu, 19 Jan 2012 22:51:56 -0800

Hi, Ted Dunning, 

Thank you for your reply. I can understand your point on putting a "language_s" 
field and then keeping all the files together, which speed-up searching.


But then there occurs a problem of using analyzer in indexing. I assume files 
encoded in different language should be handled using different analyzers, i.e. 
different tokenizers and filters. Can you elaborate a little bit on the design 
that you propose, especially in how files encoded in different languages can be 
handled during the indexing?
Thank you. 
 
Best Regards

Ni, Bing  


> From: ted.dunn...@gmail.com
> Date: Fri, 20 Jan 2012 03:55:48 +0000
> Subject: Re: Tika0.10 language identifier in Solr3.5.0
> To: solr-user@lucene.apache.org
> 
> Normally this is done by putting a field on each document rather than
> separating the documents into separate corpora.  Keeping them together
> makes the final search faster.
> 
> At query time, you can add all of the language keys that you think are
> relevant based on your language id applied to the query and then group the
> results on those keys so that users can inspect different language results.
>  If you require the correct language key, you should get pretty good
> retrieval speed.
> 
> On Fri, Jan 20, 2012 at 3:35 AM, nibing <nibing_...@hotmail.com> wrote:
> 
> >
> > Hi, Jan Høydahl  You are right.  I am hoping to detect the language of a
> > query, so that the serarching can be done according to the language
> > detected. Since people often type a few words, which is too few to detect,
> > then it is hard to do that.  Let me describe a little bit about the solr
> > server in my design. It consists of several cores, corresponding to the
> > several languages, which is built during indexing. Since language detection
> > in indexing can be done with Tika identifier, then we are currently OK. But
> > the problem is about searching. I want to do language detection first
> > before do searching in the individual cores. In the case that detection
> > result is ambiguous and several languages are returned, we probably returns
> > a set of results, and let user to decide which language set of results they
> > want to look into. In general, it is just the same with the language
> > supported by google. Do you have some suggestions if I want to achieve
> > multilingual search described as above?  Thank you.
> > Best Regards
> > Ni, Bing
> >
> >  > Subject: Re: Tika0.10 language identifier in Solr3.5.0
> > > From: jan....@cominvent.com
> > > Date: Thu, 19 Jan 2012 12:31:01 +0100
> > > To: solr-user@lucene.apache.org
> > >
> > > Hi,
> > >
> > > You may use the string as you choose, for instance filtering
> > (fq=language_s:en) or for faceting (facet.field=language_s). What are you
> > looking to do?
> > >
> > > What would you like to detect on the query side? The language of the
> > search string? That is very hard since people type very few words into the
> > search box.
> > >
> > > --
> > > Jan Høydahl, search solution architect
> > > Cominvent AS - www.cominvent.com
> > > Solr Training - www.solrtraining.com
> > >
> > > On 19. jan. 2012, at 09:22, nibing wrote:
> > >
> > > >
> > > > Hi, all,
> > > >
> > > >
> > > >
> > > > I am using Solr3.5.0 which applies Tika0.10 to do language detection,
> > > > and I have a couple of questions about this function.
> > > >
> > > >
> > > >
> > > > 1. I can see the outcome of the language detection in a field
> > > > "language_s". But what action will be taken according to the different
> > > > language code? How to configure?
> > > >
> > > >
> > > >
> > > > 2. Currently the language detection only happens in indexing. Is it
> > > > possible to use the function in searching as well? How to configure?
> > > >
> > > >
> > > >
> > > > Many thanks.
> > > >
> > > >
> > > > Best Regards
> > > >
> > > > Ni, Bing
> > > >
> > >
> >

RE: Tika0.10 language identifier in Solr3.5.0

Reply via email to