Nice list.

You may still need to mark the language of each document. There are
plenty of cross-language collisions: "die" and "boot" have different
meanings in German and English. Proper nouns ("Laserjet") may be the
same in all languages, a different problem if you are trying to get
answers in one language.

At one point, I considered using Unicode language tagging on each
token to keep it all straight. Effectively, index "de/Boot" or
"en/Laserjet".

wunder

On 3/20/08 9:20 AM, "Benson Margulies" <[EMAIL PROTECTED]> wrote:

> Unless you can come up with language-neutral tokenization and stemming,
> you
need to:
>
> a) know the language of each document.
> b) run a different
> analyzer depending on the language.
> c) force the user to tell you the language of the query.
> d) run the query through the same analyzer.


Reply via email to