On 05/20/2014 11:31 AM, Geepalem wrote:
Hi,
What is the filter to be used to implement stemming for Chinese and Japanese
language field types.
For English, I have used <filter class="solr.SnowballPorterFilterFactory"
language="English" /> and its working fine.
What do you mean by "working fine"?
Try analyzing this with text_en field type:
単語は何個ありますか?
This Japanese sentence for "How many tokens are there?", and the correct
answer is 5, 6 or 7, depending on how to count some compound words.
You should be seeing 10, using text_en, instead.
Try using text_ja. You will see 7.
I don't recommend to use text_cjk for Chinese, Japanese and Korean.
They are *very* different languages, and you should be using a different
analyzer for each.
StandardTokenizer just doesn't work for Chinese and Japanese at all since
there are no spaces between words in these languages.
Kuro