Nice list. You may still need to mark the language of each document. There are plenty of cross-language collisions: "die" and "boot" have different meanings in German and English. Proper nouns ("Laserjet") may be the same in all languages, a different problem if you are trying to get answers in one language.
At one point, I considered using Unicode language tagging on each token to keep it all straight. Effectively, index "de/Boot" or "en/Laserjet". wunder On 3/20/08 9:20 AM, "Benson Margulies" <[EMAIL PROTECTED]> wrote: > Unless you can come up with language-neutral tokenization and stemming, > you need to: > > a) know the language of each document. > b) run a different > analyzer depending on the language. > c) force the user to tell you the language of the query. > d) run the query through the same analyzer.