We had the same problem for our fields and we wrote a Tokenizer using the icu4j library. Breaking tokens at script changes, and dealing with them according the script and the configured Breakiterators. This works out very well, as we also add the "scrip" information to the token so later filter can easily process on them without checking the tokens again for being some CJK-token (or greek or Russian or Hebrew or, or, or...)
After this you then can put any filter (N-gram, dictionary-segmenter) to make your tokens better. Jan >-----Original Message----- >From: ext Jacob Elder [mailto:jel...@locamoda.com] >Sent: Montag, 29. November 2010 23:15 >To: solr-user@lucene.apache.org >Subject: Good example of multiple tokenizers for a single field > >I am looking for a clear example of using more than one tokenizer for a >source single field. My application has a single "body" field which until >recently was all latin characters, but we're now encountering both English >and Japanese words in a single message. Obviously, we need to be using CJK >in addition to WhitespaceTokenizerFactory. > >I've found some references to using copyFields or NGrams but I can't quite >grasp what the whole solution would look like. > >-- >Jacob Elder >@jelder >(646) 535-3379