Extreme, but guaranteed to work and it avoids bad IDF when there are inter-language collisions. In Ultraseek, we only stored the hash, so the size of the source token didn't matter.
Trademarks are a bad source of collisions and anomalous IDF. If you have LaserJet support docs in 20 languages, the term "LaserJet" will have a document frequency 20X higher than the terms in a single language and will score too low. Ultraseek handles macaronic documents when the script makes it possible, for example, roman is sent to the English stemmer in a Japanese document, Hangul always goes to the Korean segmenter/stemmer. A simpler approach is to tag each document with a language, like "lang:de", then use a filter query to restrict the documents to the query language. Per-token tagging still strikes me as the "right" approach. It makes all sorts of things work, like keeping fuzzy matches within the same language. We didn't do it in Ultraseek because it would have been an incompatible index change and the benefit didn't justify that. wunder == Walter Underwood Former Ultraseek Architect Current Entire Netflix Search Department On 3/20/08 9:45 AM, "Benson Margulies" <[EMAIL PROTECTED]> wrote: > Token/by/token seems a bit extreme. Are you concerned with macaronic > documents? > > On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood <[EMAIL PROTECTED]> > wrote: > >> Nice list. >> >> You may still need to mark the language of each document. There are >> plenty of cross-language collisions: "die" and "boot" have different >> meanings in German and English. Proper nouns ("Laserjet") may be the >> same in all languages, a different problem if you are trying to get >> answers in one language. >> >> At one point, I considered using Unicode language tagging on each >> token to keep it all straight. Effectively, index "de/Boot" or >> "en/Laserjet". >> >> wunder >> >> On 3/20/08 9:20 AM, "Benson Margulies" <[EMAIL PROTECTED]> wrote: >> >>> Unless you can come up with language-neutral tokenization and stemming, >>> you >> need to: >>> >>> a) know the language of each document. >>> b) run a different >>> analyzer depending on the language. >>> c) force the user to tell you the language of the query. >>> d) run the query through the same analyzer. >> >> >>