Re: Analysis of Japanese characters

Shawn Heisey Mon, 07 Apr 2014 13:44:05 -0700

On 4/7/2014 2:07 PM, T. Kuro Kurosaka wrote:

Tom,
You should be using JapaneseAnalyzer (kuromoji).
Neither CJK nor ICU tokenize at word boundaries.

Is JapaneseAnalyzer configurable with regard to what it does withnon-japanese text? If it's not, it won't work for me.

We use a combination of tokenizers and filters because there are no fullanalyzers that do what we require. My analysis chain (for our indexthat's primarily english) has evolved over the last few years into itscurrent form:


http://apaste.info/xa5

For our Japanese customer, we have recently changed fromICUFoldingFilter to ASCIIFoldingFilter and ICUNormalizer2Filter, becausethey do not want us to fold accent marks on Japanese characters. I donot understand enough about Japanese to have an opinion on this, beyondthe general "we should normalize EVERYTHING" approach. The data fromthis customer is not purely Japanese - there is a lot of English aswell, and quite possibly a small amount of other languages.


Thanks,
Shawn

Re: Analysis of Japanese characters

Reply via email to