> So, if you are trying to make sure your index breaks words properly on 
> eastern languages, just use ICU Tokenizer.   

I defer to the expertise on this list, but last I checked ICUTokenizer uses 
dictionary lookup to tokenize CJK.  This may work well for some tasks, but I 
haven't evaluated whether it performs better than smartcn or even just 
cjkbigramfilter on actual retrieval tasks, and I'd be hesitant to state "just 
use" and imply the problem is solved.  

I thought I remembered ICUTokenizer not playing well with the CJKBigramFilter, 
but it appears to be working in 6.6.

> use the ICUNormalizer
I could not agree with this more.  

-----Original Message-----
From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da...@nih.gov] 
Sent: Tuesday, June 20, 2017 12:02 PM
To: solr-user@lucene.apache.org
Subject: RE: How are people using the ICUTokenizer?

Joel,

I think the issue is doing word-breaking according to ICU rules.   So, if you 
are trying to make sure your index breaks words properly on eastern languages, 
just use ICU Tokenizer.   Unless your text is already in an ICU normal form, 
you should always use the ICUNormalizer character filter along with this:

https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.ICUNormalizer2CharFilterFactory

I think that this would be good with Shingles when you are not removing stop 
words, maybe in an alternate analysis of the same content.

I'm using it in this way, with shingles for phrase recognition and only doc 
freq and term freq - my possibly naïve idea is that I do not need positions and 
offsets if I'm using shingles, and my main goal is to do a MoreLikeThis query 
using the shingled versions of fields.

-----Original Message-----
From: Joel Bernstein [mailto:joels...@gmail.com] 
Sent: Tuesday, June 20, 2017 11:52 AM
To: solr-user@lucene.apache.org
Subject: How are people using the ICUTokenizer?

It seems that there are some powerful capabilities in the ICUTokenizer. I was 
wondering how the community is making use of it.

Does anyone have experience working with the ICUTokenizer that they can share?


Joel Bernstein
http://joelsolr.blogspot.com/

Reply via email to