> So, if you are trying to make sure your index breaks words properly on > eastern languages, just use ICU Tokenizer.
I defer to the expertise on this list, but last I checked ICUTokenizer uses dictionary lookup to tokenize CJK. This may work well for some tasks, but I haven't evaluated whether it performs better than smartcn or even just cjkbigramfilter on actual retrieval tasks, and I'd be hesitant to state "just use" and imply the problem is solved. I thought I remembered ICUTokenizer not playing well with the CJKBigramFilter, but it appears to be working in 6.6. > use the ICUNormalizer I could not agree with this more. -----Original Message----- From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da...@nih.gov] Sent: Tuesday, June 20, 2017 12:02 PM To: solr-user@lucene.apache.org Subject: RE: How are people using the ICUTokenizer? Joel, I think the issue is doing word-breaking according to ICU rules. So, if you are trying to make sure your index breaks words properly on eastern languages, just use ICU Tokenizer. Unless your text is already in an ICU normal form, you should always use the ICUNormalizer character filter along with this: https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.ICUNormalizer2CharFilterFactory I think that this would be good with Shingles when you are not removing stop words, maybe in an alternate analysis of the same content. I'm using it in this way, with shingles for phrase recognition and only doc freq and term freq - my possibly naïve idea is that I do not need positions and offsets if I'm using shingles, and my main goal is to do a MoreLikeThis query using the shingled versions of fields. -----Original Message----- From: Joel Bernstein [mailto:joels...@gmail.com] Sent: Tuesday, June 20, 2017 11:52 AM To: solr-user@lucene.apache.org Subject: How are people using the ICUTokenizer? It seems that there are some powerful capabilities in the ICUTokenizer. I was wondering how the community is making use of it. Does anyone have experience working with the ICUTokenizer that they can share? Joel Bernstein http://joelsolr.blogspot.com/