No specific answers, but have you read the detailed CJK article
collection: http://discovery-grindstone.blogspot.ca/ . There is a lot
of information there.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Thu, Apr 3, 2014 at 12:19 AM, Shawn Heisey <s...@elyograg.org> wrote:
> My company is setting up a system for a customer from Japan.  We have an
> existing system that handles primarily English.
>
> Here's my general text analysis chain:
>
> http://apaste.info/xa5
>
> After talking to the customer about problems they are encountering with
> search, we have determined that some of the problems are caused because
> ICUTokenizer splits on *any* character set change, including changes between
> different Japanase character sets.
>
> Knowing the risk of this being an XY problem, here's my question: Can
> someone help me develop a rule file for the ICU Tokenizer that will *not*
> split when the character set changes from one of the japanese character sets
> to another japanese character set, but still split on other character set
> changes?
>
> Thanks,
> Shawn
>

Reply via email to