Japanese Query Unexpectedly Misses

Stephen Lewis Bianamara Thu, 17 Oct 2019 10:45:12 -0700

Hi SOLR Community,

I have an example of a basic Japanese indexing/recall scenario which I am
trying to support, but cannot get to work.


The scenario is: I would like for 日本人 (Japanese Person) to be matched by
either 日本 (Japan) or 人 (Person). Currently, I am not seeing this work. My
Japanese text field currently has the tokenizer

> <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>
What is most surprising to me is that I though this is what mode="search"
was made for. From the docs, I see

> Use search mode to get a noun-decompounding effect useful for search.
> search mode improves segmentation for search at the expense of
> part-of-speech accuracy
>

I analyzed the breakdown, and I can see that the tokenizer is not
generating three tokens (one for Japan, one for person, and one for
Japanese Person) as I would have expected. Interestingly, the tokenizer
does recognize that  日本人 is a compound noun, so it would seem to be that it
should decompound it (see image below).

Can you help me figure out if my configuration is incorrect, or if there is
some way to fix this scenario?

Thanks!
Stephen


[image: image.png]

Japanese Query Unexpectedly Misses

Reply via email to