Hi Tomoko, Thank you for your advice. Will look into the java source code of the Token Filters.
Regards, Edwin On 26 October 2015 at 13:16, Tomoko Uchida <tomoko.uchida.1...@gmail.com> wrote: > > Will try to see if there is anyway to managed it by only a single field? > > Of course you can try to create custom Tokenizer or TokenFilter that > perfectly meets your needs. > I would copy the source codes of EdgeNGramTokenFilter and modify > incrementToken() method. It seems reasonable way for me. > incrementToken() of EdgeNGramTokenFilter cannot be overrided, it is defined > as "final" on Solr 5, so subclassing will not work. > And corresponding custom TokenFilterFactory class is also needed. (See > EdgeNGramFilterFactory.) > > If you are not familiar with both of Java and internal architecture of > Lucene/Solr, > custom classes can brought intricate bugs/problems into your system. Be > sure to keep them under control. > > Anyway, checkout and look into java sources of TokenFilters included in > Solr if you have not yet. > > Thanks, > Tomoko > > 2015-10-26 11:19 GMT+09:00 Zheng Lin Edwin Yeo <edwinye...@gmail.com>: > > > Hi Tomoko, > > > > Thank you for your recommendation. > > > > I wasn't in favour of using copyField at first to have 2 separate fields > > for English and Chinese tokens, as it not only increase the index size, > > but also slow down the performance for both indexing and querying. > > > > Will try to see if there is anyway to managed it by only a single field? > > > > Regards. > > Edwin > > > > > > On 25 October 2015 at 22:59, Tomoko Uchida <tomoko.uchida.1...@gmail.com > > > > wrote: > > > > > Hi, Edwin, > > > > > > > This means it is better to have 2 separate fields for English and > > Chinese > > > words? > > > > > > Yes. I mean, > > > 1. Define FIELD_1 that use HMMChineseTokenizerFactory to extract > English > > > and Chinese tokens. > > > 2. Define FIELD_2 that use PatternTokenizerFactory to extract English > > > tokens and EdgeNGramFilter to break up tokens to sub-strings. > > > There might be some possible tokenizer/filter chains to extract > > English > > > tokens, please try and find the best way ;) > > > 3. Index original text to FIELD_1 to search tokens as they are. (for > both > > > of English and Chinese words) > > > 4. Index original text to FIELD_2 to perform prefix match. (for English > > > words) > > > 5. Search FIELD_1 and FIELD_2 by using edismax query parser, etc. > > > > > > You can use copyField to index original text data to FIELD_1 and > FIELD_2. > > > Downside of this method is that increase index size as you know. > > > > > > If you want to manage that *by one field*, I think you can create > custom > > > token filter on your own... but it may be slightly advanced. > > > > > > Thanks, > > > Tomoko > > > > > > 2015-10-25 22:48 GMT+09:00 Zheng Lin Edwin Yeo <edwinye...@gmail.com>: > > > > > > > Hi Tomoko, > > > > > > > > Thank you for your reply. > > > > > > > > > If you need to perform partial (prefix) match for **only English > > > words**, > > > > > you can create a separate field that keeps only English words (I've > > > never > > > > > tried that, but might be possible by PatternTokenizerFactory or > other > > > > > tokenizer/filter chains...,) and apply EdgeNGramFilterFactory to > the > > > > field. > > > > > > > > This means it is better to have 2 separate fields for English and > > Chinese > > > > words? > > > > Not quite sure what you mean by that. > > > > > > > > Regards, > > > > Edwin > > > > > > > > > > > > > > > > On 25 October 2015 at 11:42, Tomoko Uchida < > > tomoko.uchida.1...@gmail.com > > > > > > > > wrote: > > > > > > > > > > I have rich-text documents that are in both English and Chinese, > > and > > > > > > currently I have EdgeNGramFilterFactory enabled during indexing, > > as I > > > > > need > > > > > > it for partial matching for English words. But this means it will > > > also > > > > > > break up each of the Chinese characters into different tokens. > > > > > > > > > > EdgeNGramFilterFactory creates sub-strings (prefixes) from each > > token. > > > > Its > > > > > behavior is independent of language. > > > > > If you need to perform partial (prefix) match for **only English > > > words**, > > > > > you can create a separate field that keeps only English words (I've > > > never > > > > > tried that, but might be possible by PatternTokenizerFactory or > other > > > > > tokenizer/filter chains...,) and apply EdgeNGramFilterFactory to > the > > > > field. > > > > > > > > > > Hope it helps, > > > > > Tomoko > > > > > > > > > > 2015-10-23 13:04 GMT+09:00 Zheng Lin Edwin Yeo < > edwinye...@gmail.com > > >: > > > > > > > > > > > Hi, > > > > > > > > > > > > Would like to check, is it good to use EdgeNGramFilterFactory for > > > > indexes > > > > > > that contains Chinese characters? > > > > > > Will it affect the accuracy of the search for Chinese words? > > > > > > > > > > > > I have rich-text documents that are in both English and Chinese, > > and > > > > > > currently I have EdgeNGramFilterFactory enabled during indexing, > > as I > > > > > need > > > > > > it for partial matching for English words. But this means it will > > > also > > > > > > break up each of the Chinese characters into different tokens. > > > > > > > > > > > > I'm using the HMMChineseTokenizerFactory for my tokenizer. > > > > > > > > > > > > Thank you. > > > > > > > > > > > > Regards, > > > > > > Edwin > > > > > > > > > > > > > > > > > > > > >