On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote: > > * As a tokenizer, I use the WhitespaceTokenizer. > > * Then I apply a custom filter that looks for CJK chars, and re-tokenizes > any CJK chars into one-token-per-char. This custom filter was written by > someone other than me; it is open source; but I'm not sure if it's actually > in a public repo, or how well documented it is. I can put you in touch with > the author to try and ask. There may also be a more standard filter other > than the custom one I'm using that does the same thing? >
You are describing what standardtokenizer does.