+1 That's exactly what we need, too.
On Mon, Nov 29, 2010 at 5:28 PM, Shawn Heisey <elyog...@elyograg.org> wrote: > On 11/29/2010 3:15 PM, Jacob Elder wrote: > >> I am looking for a clear example of using more than one tokenizer for a >> source single field. My application has a single "body" field which until >> recently was all latin characters, but we're now encountering both English >> and Japanese words in a single message. Obviously, we need to be using CJK >> in addition to WhitespaceTokenizerFactory. >> > > What I'd like to see is a CJK filter that runs after tokenization > (whitespace in my case) and doesn't do anything but handle the CJK > characters. If there are no CJK characters in the token, it should do > nothing at all. The CJK tokenizer does a whole host of other things that I > want to handle myself. > > Shawn > > -- Jacob Elder @jelder (646) 535-3379