Re: Tika0.10 language identifier in Solr3.5.0

Ted Dunning Fri, 20 Jan 2012 08:20:14 -0800

Otis,

Can you say why there needs to be a field per language?  Why not have a
polyglot analyzer?


On Fri, Jan 20, 2012 at 7:29 AM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Ni, Bing
>
> I believe you will need to pre-define fields for all languages you want to
> handle and specify an appropriate language-specific analyzer for each of
> those fields.
> This also means that if you encounter a new language, you will need to
> adjust your schema to support a new language.  Of course, your Lang
> Identifier will also need to be trained to recognize that language.
>
> I know ElasticSearch allows you to specify a field analyzer at index time,
> which is very handy.  I don't think you can do that with Solr today, but
> maybe I'm forgetting something! :)
>
> Otis
> ----
> Performance Monitoring SaaS for Solr -
> http://sematext.com/spm/solr-performance-monitoring/index.html
>
>
> ----- Original Message -----
> > From: nibing <nibing_...@hotmail.com>
> > To: solr-user@lucene.apache.org
> > Cc:
> > Sent: Friday, January 20, 2012 1:51 AM
> > Subject: RE: Tika0.10 language identifier in Solr3.5.0
> >
> >
> > Hi, Ted Dunning,
> >
> > Thank you for your reply. I can understand your point on putting a
> > "language_s" field and then keeping all the files together, which
> > speed-up searching.
> >
> > But then there occurs a problem of using analyzer in indexing. I assume
> files
> > encoded in different language should be handled using different
> analyzers, i.e.
> > different tokenizers and filters. Can you elaborate a little bit on the
> design
> > that you propose, especially in how files encoded in different languages
> can be
> > handled during the indexing?
> > Thank you.
> >
> > Best Regards
> >
> > Ni, Bing
> >
> >
> >>  From: ted.dunn...@gmail.com
> >>  Date: Fri, 20 Jan 2012 03:55:48 +0000
> >>  Subject: Re: Tika0.10 language identifier in Solr3.5.0
> >>  To: solr-user@lucene.apache.org
> >>
> >>  Normally this is done by putting a field on each document rather than
> >>  separating the documents into separate corpora.  Keeping them together
> >>  makes the final search faster.
> >>
> >>  At query time, you can add all of the language keys that you think are
> >>  relevant based on your language id applied to the query and then group
> the
> >>  results on those keys so that users can inspect different language
> results.
> >>   If you require the correct language key, you should get pretty good
> >>  retrieval speed.
> >>
> >>  On Fri, Jan 20, 2012 at 3:35 AM, nibing <nibing_...@hotmail.com>
> > wrote:
> >>
> >>  >
> >>  > Hi, Jan Høydahl  You are right.  I am hoping to detect the language
> of
> > a
> >>  > query, so that the serarching can be done according to the language
> >>  > detected. Since people often type a few words, which is too few to
> > detect,
> >>  > then it is hard to do that.  Let me describe a little bit about the
> > solr
> >>  > server in my design. It consists of several cores, corresponding to
> > the
> >>  > several languages, which is built during indexing. Since language
> > detection
> >>  > in indexing can be done with Tika identifier, then we are currently
> > OK. But
> >>  > the problem is about searching. I want to do language detection first
> >>  > before do searching in the individual cores. In the case that
> > detection
> >>  > result is ambiguous and several languages are returned, we probably
> > returns
> >>  > a set of results, and let user to decide which language set of
> results
> > they
> >>  > want to look into. In general, it is just the same with the language
> >>  > supported by google. Do you have some suggestions if I want to
> achieve
> >>  > multilingual search described as above?  Thank you.
> >>  > Best Regards
> >>  > Ni, Bing
> >>  >
> >>  >  > Subject: Re: Tika0.10 language identifier in Solr3.5.0
> >>  > > From: jan....@cominvent.com
> >>  > > Date: Thu, 19 Jan 2012 12:31:01 +0100
> >>  > > To: solr-user@lucene.apache.org
> >>  > >
> >>  > > Hi,
> >>  > >
> >>  > > You may use the string as you choose, for instance filtering
> >>  > (fq=language_s:en) or for faceting (facet.field=language_s). What are
> > you
> >>  > looking to do?
> >>  > >
> >>  > > What would you like to detect on the query side? The language of
> > the
> >>  > search string? That is very hard since people type very few words
> into
> > the
> >>  > search box.
> >>  > >
> >>  > > --
> >>  > > Jan Høydahl, search solution architect
> >>  > > Cominvent AS - www.cominvent.com
> >>  > > Solr Training - www.solrtraining.com
> >>  > >
> >>  > > On 19. jan. 2012, at 09:22, nibing wrote:
> >>  > >
> >>  > > >
> >>  > > > Hi, all,
> >>  > > >
> >>  > > >
> >>  > > >
> >>  > > > I am using Solr3.5.0 which applies Tika0.10 to do language
> > detection,
> >>  > > > and I have a couple of questions about this function.
> >>  > > >
> >>  > > >
> >>  > > >
> >>  > > > 1. I can see the outcome of the language detection in a
> > field
> >>  > > > "language_s". But what action will be taken
> > according to the different
> >>  > > > language code? How to configure?
> >>  > > >
> >>  > > >
> >>  > > >
> >>  > > > 2. Currently the language detection only happens in
> > indexing. Is it
> >>  > > > possible to use the function in searching as well? How to
> > configure?
> >>  > > >
> >>  > > >
> >>  > > >
> >>  > > > Many thanks.
> >>  > > >
> >>  > > >
> >>  > > > Best Regards
> >>  > > >
> >>  > > > Ni, Bing
> >>  > > >
> >>  > >
> >>  >
> >
>

Re: Tika0.10 language identifier in Solr3.5.0

Reply via email to