Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

Zheng Lin Edwin Yeo Wed, 23 Sep 2015 09:27:02 -0700

Hi Erick,

Yes I did tried on the StandardTokenizer, and it seems to work well for
both English and Chinese words. Also, it has a faster indexing and response
time during query. Just that in StandardTokenizer which tokenize on
whitespace, the cutting of the chinese words will be indiviual character by
character, instead of a phrase which only custom chinese tokenizer can
support.


I tried on HMMChineseTokenizer too, and the indexing and querying time is
slower than StandardTokenizer. This works well for the Chinese words, but
there is alot of mis-match for the English words

Regards,
Edwin


On 23 September 2015 at 23:04, Erick Erickson <erickerick...@gmail.com>
wrote:

> In a word, no. The CJK languages in general don't
> necessarily tokenize on whitespace so using a tokenizer
> that uses whitespace as it's default tokenizer simply won't
> work.
>
> Have you tried it? It seems a simple test would get you
> an answer faster.
>
> Best,
> Erick
>
> On Wed, Sep 23, 2015 at 7:41 AM, Zheng Lin Edwin Yeo <edwinye...@gmail.com
> >
> wrote:
>
> > Hi,
> >
> > Would like to check, will StandardTokenizerFactory works well for
> indexing
> > both English and Chinese (Bilingual) documents, or do we need tokenizers
> > that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
> >
> >
> > Regards,
> > Edwin
> >
>

Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

Reply via email to