Shawn, ICUTokenizer is operating as designed here.
The key to understanding this is o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from ScriptIterator.next() with the scripts of two consecutive characters; these methods together find script boundaries. Here’s ScriptIterator.isSameScript(): /** Determine if two scripts are compatible. */ private static boolean isSameScript(int scriptOne, int scriptTwo) { return scriptOne <= UScript.INHERITED || scriptTwo <= UScript.INHERITED || scriptOne == scriptTwo; } ASCII digits are in the Unicode script named “Common” (see <http://www.unicode.org/Public/6.3.0/ucd/Scripts.txt>), and UScript.COMMON (0) is less than UScript.INHERITED (1) (see <http://www.icu-project.org/~mow/ICU4JCodeCoverage/Current/com/ibm/icu/lang/UScript.html>), so there will be no script boundary detected between a character from an oriental script followed by an ASCII digit, or vice versa - the ASCII digit will be assigned the same script as the preceding character. See UAX#24 for more info: <http://www.unicode.org/reports/tr24/tr24-21.html> (that’s the Unicode 6.3.0 version, which is supported by Lucene/Solr 4.9). Steve On Aug 12, 2014, at 7:00 PM, Shawn Heisey <s...@elyograg.org> wrote: > See the original message on this thread for full details. Some > additional information: > > This happens on version 4.6.1, 4.7.2, and 4.9.0. Here is a screenshot > showing the analysis problem in more detail. The first line you can see > is the ICUTokenizer. > > https://www.dropbox.com/s/9wbi7lz77ivya9j/ICUTokenizer-wrong-analysis.png > > The original field value was: > > 20世紀の100人;ポートレートアーカイブス;政治家・軍人;政治家・指導 > 者・軍人;[政 治],100peopeof20century,pploftwentycentury,pploftwentycentury > > Thanks, > Shawn >