Hi Daniel, As you know, Chinese and Japanese does not use space or any other delimiters to break words. To overcome this problem, CJKTokenizer uses a method called bi-gram where the run of ideographic (=Chinese) characters are made into tokens of two neighboring characters. So a run of five characters ABCDE will result in four tokens AB, BC, CD, and DE.
So search for "BC" will hits this text, even if AB is a word and CD is another word. That is, it increases the noise in the hits. I don't know how much real problem it would be for Chinese. But for Japanese, my native language, this is a problem. Because of this, search result for Kyoto will include false hits of documents that incldue Tokyoto, i.e. Tokyo prefecture. There is another method called morphological analysis, which uses dictionaries and grammer rules to break down text into real words. You might want to consider this method. -kuro