No specific answers, but have you read the detailed CJK article collection: http://discovery-grindstone.blogspot.ca/ . There is a lot of information there.
Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Thu, Apr 3, 2014 at 12:19 AM, Shawn Heisey <s...@elyograg.org> wrote: > My company is setting up a system for a customer from Japan. We have an > existing system that handles primarily English. > > Here's my general text analysis chain: > > http://apaste.info/xa5 > > After talking to the customer about problems they are encountering with > search, we have determined that some of the problems are caused because > ICUTokenizer splits on *any* character set change, including changes between > different Japanase character sets. > > Knowing the risk of this being an XY problem, here's my question: Can > someone help me develop a rule file for the ICU Tokenizer that will *not* > split when the character set changes from one of the japanese character sets > to another japanese character set, but still split on other character set > changes? > > Thanks, > Shawn >