Re: Preparing the ground for a real multilang index

Paul Libbrecht Thu, 02 Jul 2009 15:49:54 -0700

I believe the proper way is for the server to compute a list of accepted languages in order of preferences. The web-platform language (e.g. the user-setting), and the values in the Accept-Language http header (which are from the browser or platform).


Then you expand your query for surfing waves (say) to:
- phrase query: surfing waves exactly (^2.0)
- two terms, no stemming: surfing waves (^1.5)
- iterate through the languages and query for stemmed variants:
  - english: surf wav ^1.0
  - german surfing wave ^0.9
  - ....

- then maybe even try the phonetic analyzer (matched in a separate field probably)

I think this is a common pattern on the web where the users, browsers, and servers are all somewhat multilingual.


paul

Le 02-juil.-09 à 22:15, Otis Gospodnetic a écrit :

Michael,
I think you really aught to know the language of the query (from a pulldown, from the browser, from user settings, somewhere) and pass that to the backend.... unless your queries are sufficiently long that their language can be identified.
Here is a handy tool for playing with language identification:

 http://www.sematext.com/demo/lid/

You'll see how hard it is to guess a language of very short texts. :)
You really want to avoid that huge OR. Often it makes no sense to OR in multilingual context. Think about the word "die" (English and German, as you know) and what happens when you include that in an OR. And does it make sense to include a "very language specific word", say "wunderbar", in an OR that goes across multiple/all languages? Funny, they have it listed at http://www.merriam-webster.com/dictionary/wunderbar
Otis--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
From: Michael Lackhoff <[email protected]>
To: [email protected]
Sent: Thursday, July 2, 2009 2:58:41 PM
Subject: Preparing the ground for a real multilang index

As pointed out in the recent thread about stemmers and other language
specifics I should handle them all in their own right. But how?

The first problem is how to know the language. Sometimes I have a
language identifier within the record, sometimes I have more than one,
sometimes I have none. How should I handle the non-obvious cases?

Given I somehow know record1 is English and record2 is German. Then I
need all my (relevant) fields for every language, e.g. I will have
TITLE_ENG and TITLE_GER and both will have their respective stemmer. But what with exotic languages? Use a catch all "language" without a stemmer?
Now a user searches for TITLE:term and I don't know beforehand the
language of "term". Do I have to expand the query to something like
"TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ..." or is there some sort of copyfield for analyzed fields? Then I could just copy all the TITLE_* fields to TITLE and don't bother with the language of the query.
Are there any solutions that prevent an index with thousands of fields
and dozens of ORed query terms?

I know I will have to implement some better multilanguage support but
would also like to keep it as simple as possible.

-Michael

smime.p7s
Description: S/MIME cryptographic signature

Re: Preparing the ground for a real multilang index

Reply via email to