RE: Good example of multiple tokenizers for a single field

jan.kurella Tue, 30 Nov 2010 00:15:07 -0800

We had the same problem for our fields and we wrote a Tokenizer using the icu4j 
library. Breaking tokens at script changes, and dealing with them according the 
script and the configured Breakiterators.
This works out very well, as we also add the "scrip" information to the token 
so later filter can easily process on them without checking the tokens again 
for being some CJK-token (or greek or Russian or Hebrew or, or, or...)


After this you then can put any filter (N-gram, dictionary-segmenter) to make 
your tokens better.

Jan

>-----Original Message-----
>From: ext Jacob Elder [mailto:jel...@locamoda.com]
>Sent: Montag, 29. November 2010 23:15
>To: solr-user@lucene.apache.org
>Subject: Good example of multiple tokenizers for a single field
>
>I am looking for a clear example of using more than one tokenizer for a
>source single field. My application has a single "body" field which until
>recently was all latin characters, but we're now encountering both English
>and Japanese words in a single message. Obviously, we need to be using CJK
>in addition to WhitespaceTokenizerFactory.
>
>I've found some references to using copyFields or NGrams but I can't quite
>grasp what the whole solution would look like.
>
>--
>Jacob Elder
>@jelder
>(646) 535-3379

RE: Good example of multiple tokenizers for a single field

Reply via email to