Re: Language support

Walter Underwood Thu, 20 Mar 2008 10:01:23 -0700

Extreme, but guaranteed to work and it avoids bad IDF when there are
inter-language collisions. In Ultraseek, we only stored the hash, so
the size of the source token didn't matter.

Trademarks are a bad source of collisions and anomalous IDF. If you have
LaserJet support docs in 20 languages, the term "LaserJet" will have
a document frequency 20X higher than the terms in a single language
and will score too low.

Ultraseek handles macaronic documents when the script makes it possible,
for example, roman is sent to the English stemmer in a Japanese document,
Hangul always goes to the Korean segmenter/stemmer.

A simpler approach is to tag each document with a language, like "lang:de",
then use a filter query to restrict the documents to the query language.

Per-token tagging still strikes me as the "right" approach. It makes
all sorts of things work, like keeping fuzzy matches within the same
language. We didn't do it in Ultraseek because it would have been an
incompatible index change and the benefit didn't justify that.

wunder
==
Walter Underwood
Former Ultraseek Architect
Current Entire Netflix Search Department

On 3/20/08 9:45 AM, "Benson Margulies" <[EMAIL PROTECTED]> wrote:

> Token/by/token seems a bit extreme. Are you concerned with macaronic
> documents?
> 
> On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood <[EMAIL PROTECTED]>
> wrote:
> 
>> Nice list.
>> 
>> You may still need to mark the language of each document. There are
>> plenty of cross-language collisions: "die" and "boot" have different
>> meanings in German and English. Proper nouns ("Laserjet") may be the
>> same in all languages, a different problem if you are trying to get
>> answers in one language.
>> 
>> At one point, I considered using Unicode language tagging on each
>> token to keep it all straight. Effectively, index "de/Boot" or
>> "en/Laserjet".
>> 
>> wunder
>> 
>> On 3/20/08 9:20 AM, "Benson Margulies" <[EMAIL PROTECTED]> wrote:
>> 
>>> Unless you can come up with language-neutral tokenization and stemming,
>>> you
>> need to:
>>> 
>>> a) know the language of each document.
>>> b) run a different
>>> analyzer depending on the language.
>>> c) force the user to tell you the language of the query.
>>> d) run the query through the same analyzer.
>> 
>> 
>>

Re: Language support

Reply via email to