Oh, Walter! Hello! I thought that name was familiar. Greetings from Basis. All that makes sense.
On Thu, Mar 20, 2008 at 1:00 PM, Walter Underwood <[EMAIL PROTECTED]> wrote: > Extreme, but guaranteed to work and it avoids bad IDF when there are > inter-language collisions. In Ultraseek, we only stored the hash, so > the size of the source token didn't matter. > > Trademarks are a bad source of collisions and anomalous IDF. If you have > LaserJet support docs in 20 languages, the term "LaserJet" will have > a document frequency 20X higher than the terms in a single language > and will score too low. > > Ultraseek handles macaronic documents when the script makes it possible, > for example, roman is sent to the English stemmer in a Japanese document, > Hangul always goes to the Korean segmenter/stemmer. > > A simpler approach is to tag each document with a language, like > "lang:de", > then use a filter query to restrict the documents to the query language. > > Per-token tagging still strikes me as the "right" approach. It makes > all sorts of things work, like keeping fuzzy matches within the same > language. We didn't do it in Ultraseek because it would have been an > incompatible index change and the benefit didn't justify that. > > wunder > == > Walter Underwood > Former Ultraseek Architect > Current Entire Netflix Search Department > > On 3/20/08 9:45 AM, "Benson Margulies" <[EMAIL PROTECTED]> wrote: > > > Token/by/token seems a bit extreme. Are you concerned with macaronic > > documents? > > > > On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood < > [EMAIL PROTECTED]> > > wrote: > > > >> Nice list. > >> > >> You may still need to mark the language of each document. There are > >> plenty of cross-language collisions: "die" and "boot" have different > >> meanings in German and English. Proper nouns ("Laserjet") may be the > >> same in all languages, a different problem if you are trying to get > >> answers in one language. > >> > >> At one point, I considered using Unicode language tagging on each > >> token to keep it all straight. Effectively, index "de/Boot" or > >> "en/Laserjet". > >> > >> wunder > >> > >> On 3/20/08 9:20 AM, "Benson Margulies" <[EMAIL PROTECTED]> wrote: > >> > >>> Unless you can come up with language-neutral tokenization and > stemming, > >>> you > >> need to: > >>> > >>> a) know the language of each document. > >>> b) run a different > >>> analyzer depending on the language. > >>> c) force the user to tell you the language of the query. > >>> d) run the query through the same analyzer. > >> > >> > >> > >