On 8/5/14, 8:36 AM, Rich Cariens wrote:
Of course this is extremely primitive and basic, but I think it would be
possible to write a CharFilter or TokenFilter that inspects the entire
TokenStream to guess the language(s), perhaps even noting where languages
change. Language and position information could be tracked, the TokenStream
rewound and then Tokens emitted with "LanguageAttributes" for downstream
Token stemmers to deal with.

I'm curious how you are planning to handle the languageAttribute.
Would each token have this attribute denoting a span of Tokens
with a language? But then how would you search
English documents that includes the term "die" while skipping
all the German documents which most likely to have "die"?

Automatic language detection works OK for long text of
regular kind of contents.  But it doesn't work well with short
text. What strategy would you use to deal with short text?

--
TK

Reply via email to