My company is setting up a system for a customer from Japan. We have an
existing system that handles primarily English.
Here's my general text analysis chain:
http://apaste.info/xa5
After talking to the customer about problems they are encountering with
search, we have determined that some of the problems are caused because
ICUTokenizer splits on *any* character set change, including changes
between different Japanase character sets.
Knowing the risk of this being an XY problem, here's my question: Can
someone help me develop a rule file for the ICU Tokenizer that will
*not* split when the character set changes from one of the japanese
character sets to another japanese character set, but still split on
other character set changes?
Thanks,
Shawn