On 4/7/2014 2:07 PM, T. Kuro Kurosaka wrote:
Tom,
You should be using JapaneseAnalyzer (kuromoji).
Neither CJK nor ICU tokenize at word boundaries.
Is JapaneseAnalyzer configurable with regard to what it does with
non-japanese text? If it's not, it won't work for me.
We use a combination of tokenizers and filters because there are no full
analyzers that do what we require. My analysis chain (for our index
that's primarily english) has evolved over the last few years into its
current form:
http://apaste.info/xa5
For our Japanese customer, we have recently changed from
ICUFoldingFilter to ASCIIFoldingFilter and ICUNormalizer2Filter, because
they do not want us to fold accent marks on Japanese characters. I do
not understand enough about Japanese to have an opinion on this, beyond
the general "we should normalize EVERYTHING" approach. The data from
this customer is not purely Japanese - there is a lot of English as
well, and quite possibly a small amount of other languages.
Thanks,
Shawn