Re: Tika0.10 language identifier in Solr3.5.0

Erick Erickson Fri, 20 Jan 2012 08:55:00 -0800

bq: Why not have a polyglot analyzer

That could work, but it makes some compromises and assumes that your
languages are "close enough", I have absolutely no clue how that would
work for English and Chinese say.


But it also introduces inconsistencies. Take stemming. Even though you
could easily stem in the correct language, throwing all those stems
into the same filed can produce interesting results at search time since
you run the risk of hitting something produced by one of the other
analysis chains.

Which may be OK from your perspective, but some will find the decreased
precision unacceptable.

The other issue is that it's really tough to understand what the language of
a query is because it's so very short. You can sometimes have your app
fake it with any language IDs set in the browser, but they aren't reliable. If
you index into language-specific fields, you can fire the same query
at all the different language-specific fields with edismax-style
handlers.

Your spelling suggestions will be "interesting" in that scenario.

The point is that putting all the text in a single field is reasonable under
some circumstances but totally unacceptable in others. It depends (tm).
If you're using Western languages, you might even be able to get away
with having a single analyzer that normalized all the accents
(one of the FoldingFilters) and not have to bother with anything else.



On Fri, Jan 20, 2012 at 8:19 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> Otis,
>
> Can you say why there needs to be a field per language?  Why not have a
> polyglot analyzer?
>
> On Fri, Jan 20, 2012 at 7:29 AM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wrote:
>
>> Ni, Bing
>>
>> I believe you will need to pre-define fields for all languages you want to
>> handle and specify an appropriate language-specific analyzer for each of
>> those fields.
>> This also means that if you encounter a new language, you will need to
>> adjust your schema to support a new language.  Of course, your Lang
>> Identifier will also need to be trained to recognize that language.
>>
>> I know ElasticSearch allows you to specify a field analyzer at index time,
>> which is very handy.  I don't think you can do that with Solr today, but
>> maybe I'm forgetting something! :)
>>
>> Otis
>> ----
>> Performance Monitoring SaaS for Solr -
>> http://sematext.com/spm/solr-performance-monitoring/index.html
>>
>>
>> ----- Original Message -----
>> > From: nibing <nibing_...@hotmail.com>
>> > To: solr-user@lucene.apache.org
>> > Cc:
>> > Sent: Friday, January 20, 2012 1:51 AM
>> > Subject: RE: Tika0.10 language identifier in Solr3.5.0
>> >
>> >
>> > Hi, Ted Dunning,
>> >
>> > Thank you for your reply. I can understand your point on putting a
>> > "language_s" field and then keeping all the files together, which
>> > speed-up searching.
>> >
>> > But then there occurs a problem of using analyzer in indexing. I assume
>> files
>> > encoded in different language should be handled using different
>> analyzers, i.e.
>> > different tokenizers and filters. Can you elaborate a little bit on the
>> design
>> > that you propose, especially in how files encoded in different languages
>> can be
>> > handled during the indexing?
>> > Thank you.
>> >
>> > Best Regards
>> >
>> > Ni, Bing
>> >
>> >
>> >>  From: ted.dunn...@gmail.com
>> >>  Date: Fri, 20 Jan 2012 03:55:48 +0000
>> >>  Subject: Re: Tika0.10 language identifier in Solr3.5.0
>> >>  To: solr-user@lucene.apache.org
>> >>
>> >>  Normally this is done by putting a field on each document rather than
>> >>  separating the documents into separate corpora.  Keeping them together
>> >>  makes the final search faster.
>> >>
>> >>  At query time, you can add all of the language keys that you think are
>> >>  relevant based on your language id applied to the query and then group
>> the
>> >>  results on those keys so that users can inspect different language
>> results.
>> >>   If you require the correct language key, you should get pretty good
>> >>  retrieval speed.
>> >>
>> >>  On Fri, Jan 20, 2012 at 3:35 AM, nibing <nibing_...@hotmail.com>
>> > wrote:
>> >>
>> >>  >
>> >>  > Hi, Jan Høydahl  You are right.  I am hoping to detect the language
>> of
>> > a
>> >>  > query, so that the serarching can be done according to the language
>> >>  > detected. Since people often type a few words, which is too few to
>> > detect,
>> >>  > then it is hard to do that.  Let me describe a little bit about the
>> > solr
>> >>  > server in my design. It consists of several cores, corresponding to
>> > the
>> >>  > several languages, which is built during indexing. Since language
>> > detection
>> >>  > in indexing can be done with Tika identifier, then we are currently
>> > OK. But
>> >>  > the problem is about searching. I want to do language detection first
>> >>  > before do searching in the individual cores. In the case that
>> > detection
>> >>  > result is ambiguous and several languages are returned, we probably
>> > returns
>> >>  > a set of results, and let user to decide which language set of
>> results
>> > they
>> >>  > want to look into. In general, it is just the same with the language
>> >>  > supported by google. Do you have some suggestions if I want to
>> achieve
>> >>  > multilingual search described as above?  Thank you.
>> >>  > Best Regards
>> >>  > Ni, Bing
>> >>  >
>> >>  >  > Subject: Re: Tika0.10 language identifier in Solr3.5.0
>> >>  > > From: jan....@cominvent.com
>> >>  > > Date: Thu, 19 Jan 2012 12:31:01 +0100
>> >>  > > To: solr-user@lucene.apache.org
>> >>  > >
>> >>  > > Hi,
>> >>  > >
>> >>  > > You may use the string as you choose, for instance filtering
>> >>  > (fq=language_s:en) or for faceting (facet.field=language_s). What are
>> > you
>> >>  > looking to do?
>> >>  > >
>> >>  > > What would you like to detect on the query side? The language of
>> > the
>> >>  > search string? That is very hard since people type very few words
>> into
>> > the
>> >>  > search box.
>> >>  > >
>> >>  > > --
>> >>  > > Jan Høydahl, search solution architect
>> >>  > > Cominvent AS - www.cominvent.com
>> >>  > > Solr Training - www.solrtraining.com
>> >>  > >
>> >>  > > On 19. jan. 2012, at 09:22, nibing wrote:
>> >>  > >
>> >>  > > >
>> >>  > > > Hi, all,
>> >>  > > >
>> >>  > > >
>> >>  > > >
>> >>  > > > I am using Solr3.5.0 which applies Tika0.10 to do language
>> > detection,
>> >>  > > > and I have a couple of questions about this function.
>> >>  > > >
>> >>  > > >
>> >>  > > >
>> >>  > > > 1. I can see the outcome of the language detection in a
>> > field
>> >>  > > > "language_s". But what action will be taken
>> > according to the different
>> >>  > > > language code? How to configure?
>> >>  > > >
>> >>  > > >
>> >>  > > >
>> >>  > > > 2. Currently the language detection only happens in
>> > indexing. Is it
>> >>  > > > possible to use the function in searching as well? How to
>> > configure?
>> >>  > > >
>> >>  > > >
>> >>  > > >
>> >>  > > > Many thanks.
>> >>  > > >
>> >>  > > >
>> >>  > > > Best Regards
>> >>  > > >
>> >>  > > > Ni, Bing
>> >>  > > >
>> >>  > >
>> >>  >
>> >
>>

Re: Tika0.10 language identifier in Solr3.5.0

Reply via email to