Have you successfully used the shingles with the MoreLikeThis query? Really curious about if this would to return the "interesting Phrases"
On Tue, Jun 20, 2017 at 12:01 PM, Davis, Daniel (NIH/NLM) [C] < daniel.da...@nih.gov> wrote: > Joel, > > I think the issue is doing word-breaking according to ICU rules. So, if > you are trying to make sure your index breaks words properly on eastern > languages, just use ICU Tokenizer. Unless your text is already in an ICU > normal form, you should always use the ICUNormalizer character filter along > with this: > > https://cwiki.apache.org/confluence/display/solr/CharFilterFactories# > CharFilterFactories-solr.ICUNormalizer2CharFilterFactory > > I think that this would be good with Shingles when you are not removing > stop words, maybe in an alternate analysis of the same content. > > I'm using it in this way, with shingles for phrase recognition and only > doc freq and term freq - my possibly naïve idea is that I do not need > positions and offsets if I'm using shingles, and my main goal is to do a > MoreLikeThis query using the shingled versions of fields. > > -----Original Message----- > From: Joel Bernstein [mailto:joels...@gmail.com] > Sent: Tuesday, June 20, 2017 11:52 AM > To: solr-user@lucene.apache.org > Subject: How are people using the ICUTokenizer? > > It seems that there are some powerful capabilities in the ICUTokenizer. I > was wondering how the community is making use of it. > > Does anyone have experience working with the ICUTokenizer that they can > share? > > > Joel Bernstein > http://joelsolr.blogspot.com/ >