For what it's worth, we've had good luck using the ICUTokenizer and associated filters. A native Chinese speaker here at the office gave us an enthusiastic thumbs up on our Chinese search results. Your mileage may vary of course.
On Wed, Sep 23, 2015 at 11:04 AM, Erick Erickson <erickerick...@gmail.com> wrote: > In a word, no. The CJK languages in general don't > necessarily tokenize on whitespace so using a tokenizer > that uses whitespace as it's default tokenizer simply won't > work. > > Have you tried it? It seems a simple test would get you > an answer faster. > > Best, > Erick > > On Wed, Sep 23, 2015 at 7:41 AM, Zheng Lin Edwin Yeo <edwinye...@gmail.com > > > wrote: > > > Hi, > > > > Would like to check, will StandardTokenizerFactory works well for > indexing > > both English and Chinese (Bilingual) documents, or do we need tokenizers > > that are customised for chinese (Eg: HMMChineseTokenizerFactory)? > > > > > > Regards, > > Edwin > > >