Re: Analysis of Japanese characters

T. Kuro Kurosaka Mon, 07 Apr 2014 13:15:12 -0700

Tom,
You should be using JapaneseAnalyzer (kuromoji).
Neither CJK nor ICU tokenize at word boundaries.


On 04/02/2014 10:33 AM, Tom Burton-West wrote:

Hi Shawn,

I'm not sure I understand the problem and why you need to solve it at the
ICUTokenizer level rather than the CJKBigramFilter
Can you perhaps give a few examples of the problem?

Have you looked at the flags for the CJKBigramfilter?
You can tell it to make bigrams of different Japanese character sets.  For
example the config given in the JavaDocs tells it to make bigrams across 3
of the different Japanese character sets.  (Is the issue related to Romaji?)

  <filter class="solr.CJKBigramFilterFactory"
        han="true" hiragana="true"
        katakana="true" hangul="true" outputUnigrams="false" />



http://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilterFactory.html

Tom


On Wed, Apr 2, 2014 at 1:19 PM, Shawn Heisey <s...@elyograg.org> wrote:

My company is setting up a system for a customer from Japan.  We have an
existing system that handles primarily English.

Here's my general text analysis chain:

http://apaste.info/xa5

After talking to the customer about problems they are encountering with
search, we have determined that some of the problems are caused because
ICUTokenizer splits on *any* character set change, including changes
between different Japanase character sets.

Knowing the risk of this being an XY problem, here's my question: Can
someone help me develop a rule file for the ICU Tokenizer that will *not*
split when the character set changes from one of the japanese character
sets to another japanese character set, but still split on other character
set changes?

Thanks,
Shawn

Re: Analysis of Japanese characters

Reply via email to