Re: ICUTokenizer acting very strangely with oriental characters

Steve Rowe Tue, 12 Aug 2014 17:30:28 -0700

Shawn,

ICUTokenizer is operating as designed here.

The key to understanding this is 
o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from 
ScriptIterator.next() with the scripts of two consecutive characters; these 
methods together find script boundaries.  Here’s ScriptIterator.isSameScript():

  /** Determine if two scripts are compatible. */
  private static boolean isSameScript(int scriptOne, int scriptTwo) {
    return scriptOne <= UScript.INHERITED || scriptTwo <= UScript.INHERITED
        || scriptOne == scriptTwo;
  }

ASCII digits are in the Unicode script named “Common” (see 
<http://www.unicode.org/Public/6.3.0/ucd/Scripts.txt>), and UScript.COMMON (0) 
is less than UScript.INHERITED (1) (see 
<http://www.icu-project.org/~mow/ICU4JCodeCoverage/Current/com/ibm/icu/lang/UScript.html>),
 so there will be no script boundary detected between a character from an 
oriental script followed by an ASCII digit, or vice versa - the ASCII digit 
will be assigned the same script as the preceding character.

See UAX#24 for more info: <http://www.unicode.org/reports/tr24/tr24-21.html> 
(that’s the Unicode 6.3.0 version, which is supported by Lucene/Solr 4.9).

Steve

On Aug 12, 2014, at 7:00 PM, Shawn Heisey <s...@elyograg.org> wrote:

> See the original message on this thread for full details.  Some
> additional information:
> 
> This happens on version 4.6.1, 4.7.2, and 4.9.0.  Here is a screenshot
> showing the analysis problem in more detail.  The first line you can see
> is the ICUTokenizer.
> 
> https://www.dropbox.com/s/9wbi7lz77ivya9j/ICUTokenizer-wrong-analysis.png
> 
> The original field value was:
> 
> ２０世紀の１００人;ポートレートアーカイブス;政治家・軍人;政治家・指導
> 者・軍人;[政 治],100peopeof20century,pploftwentycentury,pploftwentycentury
> 
> Thanks,
> Shawn
>

Re: ICUTokenizer acting very strangely with oriental characters

Reply via email to