On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote:
>
> * As a tokenizer, I use the WhitespaceTokenizer.
>
> * Then I apply a custom filter that looks for CJK chars, and re-tokenizes
> any CJK chars into one-token-per-char. This custom filter was written by
> someone other than me; it is open source; but I'm not sure if it's actually
> in a public repo, or how well documented it is.  I can put you in touch with
> the author to try and ask. There may also be a more standard filter other
> than the custom one I'm using that does the same thing?
>

You are describing what standardtokenizer does.

Reply via email to