Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

Erick Erickson Wed, 23 Sep 2015 08:05:46 -0700

In a word, no. The CJK languages in general don't
necessarily tokenize on whitespace so using a tokenizer
that uses whitespace as it's default tokenizer simply won't
work.

Have you tried it? It seems a simple test would get you
an answer faster.

Best,
Erick

On Wed, Sep 23, 2015 at 7:41 AM, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
wrote:

> Hi,
>
> Would like to check, will StandardTokenizerFactory works well for indexing
> both English and Chinese (Bilingual) documents, or do we need tokenizers
> that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
>
>
> Regards,
> Edwin
>

Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

Reply via email to