On 11/29/2010 5:43 PM, Robert Muir wrote:
On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind<rochk...@jhu.edu>  wrote:
* As a tokenizer, I use the WhitespaceTokenizer.

* Then I apply a custom filter that looks for CJK chars, and re-tokenizes
any CJK chars into one-token-per-char. This custom filter was written by
someone other than me; it is open source; but I'm not sure if it's actually
in a public repo, or how well documented it is.  I can put you in touch with
the author to try and ask. There may also be a more standard filter other
than the custom one I'm using that does the same thing?

You are describing what standardtokenizer does.


Wait, standardtokenizer already handles CJK and will put each CJK char into it's own token? Really? I had no idea! Is that documented anywhere, or you just have to look at the source to see it?

I had assumed that standardtokenizer didn't have any special handling of bytes known to be UTF-8 CJK, because that wasn't mentioned in the documentation -- but it does? That would be convenient and not require my custom code.

Jonathan

Reply via email to