Yes, each token could have a LanguageAttribute on it, just like ScriptAttributes. I didn't *think* a span would be necessary.
I would also add a multivalued "lang" field to the document. Searching English documents for "die" might look like: "q=die&lang=eng". The "lang" param could tell the RequestHandler to add a filter query "fq=lang:eng" to constrain the search to the English corpus, as well as recruit an English analyzer when tokenizing the "die" query term. Since I can't control text length, I would just let the language detection tool do it's best and not sweat it. On Wed, Aug 6, 2014 at 12:11 AM, TK <kuros...@sonic.net> wrote: > > On 8/5/14, 8:36 AM, Rich Cariens wrote: > >> Of course this is extremely primitive and basic, but I think it would be >> possible to write a CharFilter or TokenFilter that inspects the entire >> TokenStream to guess the language(s), perhaps even noting where languages >> change. Language and position information could be tracked, the >> TokenStream >> rewound and then Tokens emitted with "LanguageAttributes" for downstream >> Token stemmers to deal with. >> >> I'm curious how you are planning to handle the languageAttribute. > Would each token have this attribute denoting a span of Tokens > with a language? But then how would you search > English documents that includes the term "die" while skipping > all the German documents which most likely to have "die"? > > Automatic language detection works OK for long text of > regular kind of contents. But it doesn't work well with short > text. What strategy would you use to deal with short text? > > -- > TK > >