On 8/5/14, 8:36 AM, Rich Cariens wrote:
Of course this is extremely primitive and basic, but I think it would be possible to write a CharFilter or TokenFilter that inspects the entire TokenStream to guess the language(s), perhaps even noting where languages change. Language and position information could be tracked, the TokenStream rewound and then Tokens emitted with "LanguageAttributes" for downstream Token stemmers to deal with.
I'm curious how you are planning to handle the languageAttribute. Would each token have this attribute denoting a span of Tokens with a language? But then how would you search English documents that includes the term "die" while skipping all the German documents which most likely to have "die"? Automatic language detection works OK for long text of regular kind of contents. But it doesn't work well with short text. What strategy would you use to deal with short text? -- TK