Analysis of Japanese characters

Shawn Heisey Wed, 02 Apr 2014 10:20:37 -0700

My company is setting up a system for a customer from Japan. We have anexisting system that handles primarily English.


Here's my general text analysis chain:


http://apaste.info/xa5

After talking to the customer about problems they are encountering withsearch, we have determined that some of the problems are caused becauseICUTokenizer splits on *any* character set change, including changesbetween different Japanase character sets.

Knowing the risk of this being an XY problem, here's my question: Cansomeone help me develop a rule file for the ICU Tokenizer that will*not* split when the character set changes from one of the japanesecharacter sets to another japanese character set, but still split onother character set changes?


Thanks,
Shawn

Analysis of Japanese characters

Reply via email to