Is it possible to use JiebaTokenizer for multilingual documents?

Zheng Lin Edwin Yeo Thu, 29 Oct 2015 02:52:37 -0700

I would like to check, is it possible to use JiebaTokenizerFactory to index
Multilingual documents in Solr?


I found that JiebaTokenizerFactory works better for Chinese characters as
compared to HMMChineseTokenizerFactory.

However, for English characters, the JiebaTokenizerFactory is cutting the
words at the wrong place. For example, it will cut the word "water" as
follows:
*w|at|er*

It means that Solr will search for 3 separate words of "w", "at" and "er"
instead of the entire word "water".

Is there anyway to solve this problem, besides using a separate field for
English and Chinese characters?

Regards,
Edwin

Is it possible to use JiebaTokenizer for multilingual documents?

Reply via email to