Re: Support multiple language tokens in same field

Shawn Heisey Fri, 03 Aug 2018 06:22:20 -0700

On 8/3/2018 1:10 AM, Nitesh Kumar wrote:

As I discussed above,  in some special case, we have a situation where
these fields ( field1, field2  etc..) value can be in *CJK *pattern. That
means  field1, field2 store plain *English *text or *CJK *text. Hence, in
case of choosing *StandardTokenizer, *while indexing/query it works fine
when it has to deal with plain *English text*, whereas in the case of *CJK
text *it doesn't work appropriately.

We have one index where fields can contain both English and CJK. Thecustomer is in Japan. I designed it to work properly with all CJKcharacters, not just Japanese.

This is the fieldType I came up with after a LOT of research. Most ofthe information that was useful came from a series of blog posts:


https://apaste.info/Vfwf

I used a paste website because line wrapping within an email would havemade it difficult to copy. The paste expires in one month.

This analysis chain uses the ICU classes that are included as a contribmodule with Solr, as well as one custom jar:


https://github.com/sul-dlss/CJKFoldingFilter/blob/master/src/edu/stanford/lucene/analysis/CJKFoldingFilterFactory.java

The blog posts I used to create my schema can be found here:

http://discovery-grindstone.blogspot.com/2014/

Some people might find the ICUFoldingFilterFactory too aggressive. Ifso, replace it with ASCIIFoldingFilterFactory andICUNormalizer2FilterFactory. This is what we're actually using -- thecustomer didn't want the kinds of matches that the ICU class allowed.

Using edismax with an unusual value for the "mm" parameter might solvesome of your other issues. This is discussed in parts 8 and 12 of theblog series.

I have one note for you about your analysis chain. I notice you have afilter listed before the tokenizer. Solr will always apply thetokenizer first -- the ASCIIFoldingFilterFactory that you have listedfirst is in fact being run second. Solr will always run CharFilterentries first, then the tokenizer, then Filter entries.


Thanks,
Shawn

Re: Support multiple language tokens in same field

Reply via email to