Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

Rich Cariens Wed, 23 Sep 2015 08:13:34 -0700

For what it's worth, we've had good luck using the ICUTokenizer and
associated filters. A native Chinese speaker here at the office gave us an
enthusiastic thumbs up on our Chinese search results. Your mileage may vary
of course.


On Wed, Sep 23, 2015 at 11:04 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> In a word, no. The CJK languages in general don't
> necessarily tokenize on whitespace so using a tokenizer
> that uses whitespace as it's default tokenizer simply won't
> work.
>
> Have you tried it? It seems a simple test would get you
> an answer faster.
>
> Best,
> Erick
>
> On Wed, Sep 23, 2015 at 7:41 AM, Zheng Lin Edwin Yeo <edwinye...@gmail.com
> >
> wrote:
>
> > Hi,
> >
> > Would like to check, will StandardTokenizerFactory works well for
> indexing
> > both English and Chinese (Bilingual) documents, or do we need tokenizers
> > that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
> >
> >
> > Regards,
> > Edwin
> >
>

Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

Reply via email to