Re: EdgeNGramFilterFactory for Chinese characters

Zheng Lin Edwin Yeo Mon, 26 Oct 2015 02:50:50 -0700

Hi Tomoko,

Thank you for your advice. Will look into the java source code of the Token
Filters.


Regards,
Edwin


On 26 October 2015 at 13:16, Tomoko Uchida <tomoko.uchida.1...@gmail.com>
wrote:

> > Will try to see if there is anyway to managed it by only a single field?
>
> Of course you can try to create custom Tokenizer or TokenFilter that
> perfectly meets your needs.
> I would copy the source codes of EdgeNGramTokenFilter and modify
> incrementToken() method. It seems reasonable way for me.
> incrementToken() of EdgeNGramTokenFilter cannot be overrided, it is defined
> as "final" on Solr 5, so subclassing will not work.
> And corresponding custom TokenFilterFactory class is also needed. (See
> EdgeNGramFilterFactory.)
>
> If you are not familiar with both of Java and internal architecture of
> Lucene/Solr,
> custom classes can brought intricate bugs/problems into your system. Be
> sure to keep them under control.
>
> Anyway, checkout and look into java sources of TokenFilters included in
> Solr if you have not yet.
>
> Thanks,
> Tomoko
>
> 2015-10-26 11:19 GMT+09:00 Zheng Lin Edwin Yeo <edwinye...@gmail.com>:
>
> > Hi Tomoko,
> >
> > Thank you for your recommendation.
> >
> > I wasn't in favour of using copyField at first to have 2 separate fields
> > for English and Chinese tokens, as it  not only increase the index size,
> > but also slow down the performance for both indexing and querying.
> >
> > Will try to see if there is anyway to managed it by only a single field?
> >
> > Regards.
> > Edwin
> >
> >
> > On 25 October 2015 at 22:59, Tomoko Uchida <tomoko.uchida.1...@gmail.com
> >
> > wrote:
> >
> > > Hi, Edwin,
> > >
> > > > This means it is better to have 2 separate fields for English and
> > Chinese
> > > words?
> > >
> > > Yes. I mean,
> > > 1. Define FIELD_1 that use HMMChineseTokenizerFactory to extract
> English
> > > and Chinese tokens.
> > > 2. Define FIELD_2 that use PatternTokenizerFactory to extract English
> > > tokens and EdgeNGramFilter to break up tokens to sub-strings.
> > >     There might be some possible tokenizer/filter chains to extract
> > English
> > > tokens, please try and find the best way ;)
> > > 3. Index original text to FIELD_1 to search tokens as they are. (for
> both
> > > of English and Chinese words)
> > > 4. Index original text to FIELD_2 to perform prefix match. (for English
> > > words)
> > > 5. Search FIELD_1 and FIELD_2 by using edismax query parser, etc.
> > >
> > > You can use copyField to index original text data to FIELD_1 and
> FIELD_2.
> > > Downside of this method is that increase index size as you know.
> > >
> > > If you want to manage that *by one field*, I think you can create
> custom
> > > token filter on your own... but it may be slightly advanced.
> > >
> > > Thanks,
> > > Tomoko
> > >
> > > 2015-10-25 22:48 GMT+09:00 Zheng Lin Edwin Yeo <edwinye...@gmail.com>:
> > >
> > > > Hi Tomoko,
> > > >
> > > > Thank you for your reply.
> > > >
> > > > > If you need to perform partial (prefix) match for **only English
> > > words**,
> > > > > you can create a separate field that keeps only English words (I've
> > > never
> > > > > tried that, but might be possible by PatternTokenizerFactory or
> other
> > > > > tokenizer/filter chains...,) and apply EdgeNGramFilterFactory to
> the
> > > > field.
> > > >
> > > > This means it is better to have 2 separate fields for English and
> > Chinese
> > > > words?
> > > > Not quite sure what you mean by that.
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > > >
> > > >
> > > > On 25 October 2015 at 11:42, Tomoko Uchida <
> > tomoko.uchida.1...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > > I have rich-text documents that are in both English and Chinese,
> > and
> > > > > > currently I have EdgeNGramFilterFactory enabled during indexing,
> > as I
> > > > > need
> > > > > > it for partial matching for English words. But this means it will
> > > also
> > > > > > break up each of the Chinese characters into different tokens.
> > > > >
> > > > > EdgeNGramFilterFactory creates sub-strings (prefixes) from each
> > token.
> > > > Its
> > > > > behavior is independent of language.
> > > > > If you need to perform partial (prefix) match for **only English
> > > words**,
> > > > > you can create a separate field that keeps only English words (I've
> > > never
> > > > > tried that, but might be possible by PatternTokenizerFactory or
> other
> > > > > tokenizer/filter chains...,) and apply EdgeNGramFilterFactory to
> the
> > > > field.
> > > > >
> > > > > Hope it helps,
> > > > > Tomoko
> > > > >
> > > > > 2015-10-23 13:04 GMT+09:00 Zheng Lin Edwin Yeo <
> edwinye...@gmail.com
> > >:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Would like to check, is it good to use EdgeNGramFilterFactory for
> > > > indexes
> > > > > > that contains Chinese characters?
> > > > > > Will it affect the accuracy of the search for Chinese words?
> > > > > >
> > > > > > I have rich-text documents that are in both English and Chinese,
> > and
> > > > > > currently I have EdgeNGramFilterFactory enabled during indexing,
> > as I
> > > > > need
> > > > > > it for partial matching for English words. But this means it will
> > > also
> > > > > > break up each of the Chinese characters into different tokens.
> > > > > >
> > > > > > I'm using the HMMChineseTokenizerFactory for my tokenizer.
> > > > > >
> > > > > > Thank you.
> > > > > >
> > > > > > Regards,
> > > > > > Edwin
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: EdgeNGramFilterFactory for Chinese characters

Reply via email to