Token/by/token seems a bit extreme. Are you concerned with macaronic
documents?

On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood <[EMAIL PROTECTED]>
wrote:

> Nice list.
>
> You may still need to mark the language of each document. There are
> plenty of cross-language collisions: "die" and "boot" have different
> meanings in German and English. Proper nouns ("Laserjet") may be the
> same in all languages, a different problem if you are trying to get
> answers in one language.
>
> At one point, I considered using Unicode language tagging on each
> token to keep it all straight. Effectively, index "de/Boot" or
> "en/Laserjet".
>
> wunder
>
> On 3/20/08 9:20 AM, "Benson Margulies" <[EMAIL PROTECTED]> wrote:
>
> > Unless you can come up with language-neutral tokenization and stemming,
> > you
> need to:
> >
> > a) know the language of each document.
> > b) run a different
> > analyzer depending on the language.
> > c) force the user to tell you the language of the query.
> > d) run the query through the same analyzer.
>
>
>

Reply via email to