Hi Tomoko,
Thank you for your advice. Will look into the java source code of the Token
Filters.
Regards,
Edwin
On 26 October 2015 at 13:16, Tomoko Uchida
wrote:
> > Will try to see if there is anyway to managed it by only a single field?
>
> Of course you can try to create custom Tokenizer or
> Will try to see if there is anyway to managed it by only a single field?
Of course you can try to create custom Tokenizer or TokenFilter that
perfectly meets your needs.
I would copy the source codes of EdgeNGramTokenFilter and modify
incrementToken() method. It seems reasonable way for me.
incr
Hi Tomoko,
Thank you for your recommendation.
I wasn't in favour of using copyField at first to have 2 separate fields
for English and Chinese tokens, as it not only increase the index size,
but also slow down the performance for both indexing and querying.
Will try to see if there is anyway to
Hi, Edwin,
> This means it is better to have 2 separate fields for English and Chinese
words?
Yes. I mean,
1. Define FIELD_1 that use HMMChineseTokenizerFactory to extract English
and Chinese tokens.
2. Define FIELD_2 that use PatternTokenizerFactory to extract English
tokens and EdgeNGramFilter
Hi Tomoko,
Thank you for your reply.
> If you need to perform partial (prefix) match for **only English words**,
> you can create a separate field that keeps only English words (I've never
> tried that, but might be possible by PatternTokenizerFactory or other
> tokenizer/filter chains...,) and a
> I have rich-text documents that are in both English and Chinese, and
> currently I have EdgeNGramFilterFactory enabled during indexing, as I need
> it for partial matching for English words. But this means it will also
> break up each of the Chinese characters into different tokens.
EdgeNGramFil