What got me interested was that under the covers the ICUTokenizer is using http://icu-project.org/apiref/icu4j/com/ibm/icu/text/BreakIterator.html.
Looks like we can get sentences and titles fairly easily and paragraphs with some extra work. Joel Bernstein http://joelsolr.blogspot.com/ On Tue, Jun 20, 2017 at 1:54 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > > So, if you are trying to make sure your index breaks words properly on > eastern languages, just use ICU Tokenizer. > > I defer to the expertise on this list, but last I checked ICUTokenizer > uses dictionary lookup to tokenize CJK. This may work well for some tasks, > but I haven't evaluated whether it performs better than smartcn or even > just cjkbigramfilter on actual retrieval tasks, and I'd be hesitant to > state "just use" and imply the problem is solved. > > I thought I remembered ICUTokenizer not playing well with the > CJKBigramFilter, but it appears to be working in 6.6. > > > use the ICUNormalizer > I could not agree with this more. > > -----Original Message----- > From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da...@nih.gov] > Sent: Tuesday, June 20, 2017 12:02 PM > To: solr-user@lucene.apache.org > Subject: RE: How are people using the ICUTokenizer? > > Joel, > > I think the issue is doing word-breaking according to ICU rules. So, if > you are trying to make sure your index breaks words properly on eastern > languages, just use ICU Tokenizer. Unless your text is already in an ICU > normal form, you should always use the ICUNormalizer character filter along > with this: > > https://cwiki.apache.org/confluence/display/solr/CharFilterFactories# > CharFilterFactories-solr.ICUNormalizer2CharFilterFactory > > I think that this would be good with Shingles when you are not removing > stop words, maybe in an alternate analysis of the same content. > > I'm using it in this way, with shingles for phrase recognition and only > doc freq and term freq - my possibly naïve idea is that I do not need > positions and offsets if I'm using shingles, and my main goal is to do a > MoreLikeThis query using the shingled versions of fields. > > -----Original Message----- > From: Joel Bernstein [mailto:joels...@gmail.com] > Sent: Tuesday, June 20, 2017 11:52 AM > To: solr-user@lucene.apache.org > Subject: How are people using the ICUTokenizer? > > It seems that there are some powerful capabilities in the ICUTokenizer. I > was wondering how the community is making use of it. > > Does anyone have experience working with the ICUTokenizer that they can > share? > > > Joel Bernstein > http://joelsolr.blogspot.com/ >