Re: Stemming for Chinese and Japanese

T. Kuro Kurosaka Tue, 03 Jun 2014 12:29:02 -0700

On 05/20/2014 11:31 AM, Geepalem wrote:

Hi,


What is the filter to be used to implement stemming for Chinese and Japanese
language field types.
For English, I have used  <filter class="solr.SnowballPorterFilterFactory"
language="English" /> and its working fine.

What do you mean by "working fine"?
Try analyzing this with text_en field type:
単語は何個ありますか？
This Japanese sentence for "How many tokens are there?", and the correct
answer is 5, 6 or 7, depending on how to count some compound words.
You should be seeing 10, using text_en, instead.

Try using text_ja. You will see 7.

I don't recommend to use text_cjk for Chinese, Japanese and Korean.
They are *very* different languages, and you should be using a different
analyzer for each.

StandardTokenizer just doesn't work for Chinese and Japanese at all since
there are no spaces between words in these languages.

Kuro

Re: Stemming for Chinese and Japanese

Reply via email to