Token/by/token seems a bit extreme. Are you concerned with macaronic documents?
On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood <[EMAIL PROTECTED]> wrote: > Nice list. > > You may still need to mark the language of each document. There are > plenty of cross-language collisions: "die" and "boot" have different > meanings in German and English. Proper nouns ("Laserjet") may be the > same in all languages, a different problem if you are trying to get > answers in one language. > > At one point, I considered using Unicode language tagging on each > token to keep it all straight. Effectively, index "de/Boot" or > "en/Laserjet". > > wunder > > On 3/20/08 9:20 AM, "Benson Margulies" <[EMAIL PROTECTED]> wrote: > > > Unless you can come up with language-neutral tokenization and stemming, > > you > need to: > > > > a) know the language of each document. > > b) run a different > > analyzer depending on the language. > > c) force the user to tell you the language of the query. > > d) run the query through the same analyzer. > > >