Re: How are people using the ICUTokenizer?

David Hastings Tue, 20 Jun 2017 09:14:00 -0700

Have you successfully used the shingles with the MoreLikeThis query?
Really curious about if this would to return the "interesting Phrases"


On Tue, Jun 20, 2017 at 12:01 PM, Davis, Daniel (NIH/NLM) [C] <
daniel.da...@nih.gov> wrote:

> Joel,
>
> I think the issue is doing word-breaking according to ICU rules.   So, if
> you are trying to make sure your index breaks words properly on eastern
> languages, just use ICU Tokenizer.   Unless your text is already in an ICU
> normal form, you should always use the ICUNormalizer character filter along
> with this:
>
> https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#
> CharFilterFactories-solr.ICUNormalizer2CharFilterFactory
>
> I think that this would be good with Shingles when you are not removing
> stop words, maybe in an alternate analysis of the same content.
>
> I'm using it in this way, with shingles for phrase recognition and only
> doc freq and term freq - my possibly naïve idea is that I do not need
> positions and offsets if I'm using shingles, and my main goal is to do a
> MoreLikeThis query using the shingled versions of fields.
>
> -----Original Message-----
> From: Joel Bernstein [mailto:joels...@gmail.com]
> Sent: Tuesday, June 20, 2017 11:52 AM
> To: solr-user@lucene.apache.org
> Subject: How are people using the ICUTokenizer?
>
> It seems that there are some powerful capabilities in the ICUTokenizer. I
> was wondering how the community is making use of it.
>
> Does anyone have experience working with the ICUTokenizer that they can
> share?
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>

Re: How are people using the ICUTokenizer?

Reply via email to