Hi Erick, Yes I did tried on the StandardTokenizer, and it seems to work well for both English and Chinese words. Also, it has a faster indexing and response time during query. Just that in StandardTokenizer which tokenize on whitespace, the cutting of the chinese words will be indiviual character by character, instead of a phrase which only custom chinese tokenizer can support.
I tried on HMMChineseTokenizer too, and the indexing and querying time is slower than StandardTokenizer. This works well for the Chinese words, but there is alot of mis-match for the English words Regards, Edwin On 23 September 2015 at 23:04, Erick Erickson <erickerick...@gmail.com> wrote: > In a word, no. The CJK languages in general don't > necessarily tokenize on whitespace so using a tokenizer > that uses whitespace as it's default tokenizer simply won't > work. > > Have you tried it? It seems a simple test would get you > an answer faster. > > Best, > Erick > > On Wed, Sep 23, 2015 at 7:41 AM, Zheng Lin Edwin Yeo <edwinye...@gmail.com > > > wrote: > > > Hi, > > > > Would like to check, will StandardTokenizerFactory works well for > indexing > > both English and Chinese (Bilingual) documents, or do we need tokenizers > > that are customised for chinese (Eg: HMMChineseTokenizerFactory)? > > > > > > Regards, > > Edwin > > >