Shawn,
ICUTokenizer is operating as designed here.
The key to understanding this is
o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from
ScriptIterator.next() with the scripts of two consecutive characters; these
methods together find script boundaries. Here’s ScriptIterator.isSameScript():
/** Determine if two scripts are compatible. */
private static boolean isSameScript(int scriptOne, int scriptTwo) {
return scriptOne <= UScript.INHERITED || scriptTwo <= UScript.INHERITED
|| scriptOne == scriptTwo;
}
ASCII digits are in the Unicode script named “Common” (see
<http://www.unicode.org/Public/6.3.0/ucd/Scripts.txt>), and UScript.COMMON (0)
is less than UScript.INHERITED (1) (see
<http://www.icu-project.org/~mow/ICU4JCodeCoverage/Current/com/ibm/icu/lang/UScript.html>),
so there will be no script boundary detected between a character from an
oriental script followed by an ASCII digit, or vice versa - the ASCII digit
will be assigned the same script as the preceding character.
See UAX#24 for more info: <http://www.unicode.org/reports/tr24/tr24-21.html>
(that’s the Unicode 6.3.0 version, which is supported by Lucene/Solr 4.9).
Steve
On Aug 12, 2014, at 7:00 PM, Shawn Heisey <[email protected]> wrote:
> See the original message on this thread for full details. Some
> additional information:
>
> This happens on version 4.6.1, 4.7.2, and 4.9.0. Here is a screenshot
> showing the analysis problem in more detail. The first line you can see
> is the ICUTokenizer.
>
> https://www.dropbox.com/s/9wbi7lz77ivya9j/ICUTokenizer-wrong-analysis.png
>
> The original field value was:
>
> 20世紀の100人;ポートレートアーカイブス;政治家・軍人;政治家・指導
> 者・軍人;[政 治],100peopeof20century,pploftwentycentury,pploftwentycentury
>
> Thanks,
> Shawn
>