Joel,

I think the issue is doing word-breaking according to ICU rules.   So, if you 
are trying to make sure your index breaks words properly on eastern languages, 
just use ICU Tokenizer.   Unless your text is already in an ICU normal form, 
you should always use the ICUNormalizer character filter along with this:

https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.ICUNormalizer2CharFilterFactory

I think that this would be good with Shingles when you are not removing stop 
words, maybe in an alternate analysis of the same content.

I'm using it in this way, with shingles for phrase recognition and only doc 
freq and term freq - my possibly naïve idea is that I do not need positions and 
offsets if I'm using shingles, and my main goal is to do a MoreLikeThis query 
using the shingled versions of fields.

-----Original Message-----
From: Joel Bernstein [mailto:joels...@gmail.com] 
Sent: Tuesday, June 20, 2017 11:52 AM
To: solr-user@lucene.apache.org
Subject: How are people using the ICUTokenizer?

It seems that there are some powerful capabilities in the ICUTokenizer. I was 
wondering how the community is making use of it.

Does anyone have experience working with the ICUTokenizer that they can share?


Joel Bernstein
http://joelsolr.blogspot.com/

Reply via email to