My company is setting up a system for a customer from Japan. We have an existing system that handles primarily English.

Here's my general text analysis chain:

http://apaste.info/xa5

After talking to the customer about problems they are encountering with search, we have determined that some of the problems are caused because ICUTokenizer splits on *any* character set change, including changes between different Japanase character sets.

Knowing the risk of this being an XY problem, here's my question: Can someone help me develop a rule file for the ICU Tokenizer that will *not* split when the character set changes from one of the japanese character sets to another japanese character set, but still split on other character set changes?

Thanks,
Shawn

Reply via email to