On 4/7/2014 2:07 PM, T. Kuro Kurosaka wrote:
Tom,
You should be using JapaneseAnalyzer (kuromoji).
Neither CJK nor ICU tokenize at word boundaries.

Is JapaneseAnalyzer configurable with regard to what it does with non-japanese text? If it's not, it won't work for me.

We use a combination of tokenizers and filters because there are no full analyzers that do what we require. My analysis chain (for our index that's primarily english) has evolved over the last few years into its current form:

http://apaste.info/xa5

For our Japanese customer, we have recently changed from ICUFoldingFilter to ASCIIFoldingFilter and ICUNormalizer2Filter, because they do not want us to fold accent marks on Japanese characters. I do not understand enough about Japanese to have an opinion on this, beyond the general "we should normalize EVERYTHING" approach. The data from this customer is not purely Japanese - there is a lot of English as well, and quite possibly a small amount of other languages.

Thanks,
Shawn

Reply via email to