Re: Good example of multiple tokenizers for a single field

Jonathan Rochkind Wed, 01 Dec 2010 07:49:54 -0800

On 11/29/2010 5:43 PM, Robert Muir wrote:

On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind<rochk...@jhu.edu>  wrote:

* As a tokenizer, I use the WhitespaceTokenizer.


* Then I apply a custom filter that looks for CJK chars, and re-tokenizes
any CJK chars into one-token-per-char. This custom filter was written by
someone other than me; it is open source; but I'm not sure if it's actually
in a public repo, or how well documented it is.  I can put you in touch with
the author to try and ask. There may also be a more standard filter other
than the custom one I'm using that does the same thing?

You are describing what standardtokenizer does.

Wait, standardtokenizer already handles CJK and will put each CJK charinto it's own token? Really? I had no idea! Is that documentedanywhere, or you just have to look at the source to see it?

I had assumed that standardtokenizer didn't have any special handling ofbytes known to be UTF-8 CJK, because that wasn't mentioned in thedocumentation -- but it does? That would be convenient and not requiremy custom code.


Jonathan

Re: Good example of multiple tokenizers for a single field

Reply via email to