Re: How are people using the ICUTokenizer?

Joel Bernstein Tue, 20 Jun 2017 15:23:29 -0700

What got me interested was that under the covers the ICUTokenizer is using
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/BreakIterator.html.


Looks like we can get sentences and titles fairly easily and paragraphs
with some extra work.






Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Jun 20, 2017 at 1:54 PM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> > So, if you are trying to make sure your index breaks words properly on
> eastern languages, just use ICU Tokenizer.
>
> I defer to the expertise on this list, but last I checked ICUTokenizer
> uses dictionary lookup to tokenize CJK.  This may work well for some tasks,
> but I haven't evaluated whether it performs better than smartcn or even
> just cjkbigramfilter on actual retrieval tasks, and I'd be hesitant to
> state "just use" and imply the problem is solved.
>
> I thought I remembered ICUTokenizer not playing well with the
> CJKBigramFilter, but it appears to be working in 6.6.
>
> > use the ICUNormalizer
> I could not agree with this more.
>
> -----Original Message-----
> From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da...@nih.gov]
> Sent: Tuesday, June 20, 2017 12:02 PM
> To: solr-user@lucene.apache.org
> Subject: RE: How are people using the ICUTokenizer?
>
> Joel,
>
> I think the issue is doing word-breaking according to ICU rules.   So, if
> you are trying to make sure your index breaks words properly on eastern
> languages, just use ICU Tokenizer.   Unless your text is already in an ICU
> normal form, you should always use the ICUNormalizer character filter along
> with this:
>
> https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#
> CharFilterFactories-solr.ICUNormalizer2CharFilterFactory
>
> I think that this would be good with Shingles when you are not removing
> stop words, maybe in an alternate analysis of the same content.
>
> I'm using it in this way, with shingles for phrase recognition and only
> doc freq and term freq - my possibly naïve idea is that I do not need
> positions and offsets if I'm using shingles, and my main goal is to do a
> MoreLikeThis query using the shingled versions of fields.
>
> -----Original Message-----
> From: Joel Bernstein [mailto:joels...@gmail.com]
> Sent: Tuesday, June 20, 2017 11:52 AM
> To: solr-user@lucene.apache.org
> Subject: How are people using the ICUTokenizer?
>
> It seems that there are some powerful capabilities in the ICUTokenizer. I
> was wondering how the community is making use of it.
>
> Does anyone have experience working with the ICUTokenizer that they can
> share?
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>

Re: How are people using the ICUTokenizer?

Reply via email to