In the table below, the "IsSameS" (is same script) and "SBreak?" (script break = not IsSameS) decisions are based on what I mentioned in my previous message, and the "WBreak" (word break) decision is based on UAX#29 word break rules:
Char Code Point Script IsSameS? SBreak? WBreak? ------ -------------- ------- ------------- --------- ----------- 治 U+6CBB Han Yes No Yes ] U+005D Common Yes No Yes , U+002C Common Yes No Yes 1 U+0031 Common -- -- -- First, script boundaries are found and used as token boundaries - in the above case, no script boundary is found between "治" and "1" - and then UAX#29 word break rules are used to find token boundaries inbetween script boundaries - in the above case, there are word boundaries between each character, but ICUTokenizer throws away punctuation-only sequences between token boundaries. Steve www.lucidworks.com On Tue, Aug 12, 2014 at 9:01 PM, Shawn Heisey <s...@elyograg.org> wrote: > On 8/12/2014 6:29 PM, Steve Rowe wrote: > > Shawn, > > > > ICUTokenizer is operating as designed here. > > > > The key to understanding this is > o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from > ScriptIterator.next() with the scripts of two consecutive characters; these > methods together find script boundaries. Here’s > ScriptIterator.isSameScript(): > > > > /** Determine if two scripts are compatible. */ > > private static boolean isSameScript(int scriptOne, int scriptTwo) { > > return scriptOne <= UScript.INHERITED || scriptTwo <= > UScript.INHERITED > > || scriptOne == scriptTwo; > > } > > > > ASCII digits are in the Unicode script named “Common” (see < > http://www.unicode.org/Public/6.3.0/ucd/Scripts.txt>), and UScript.COMMON > (0) is less than UScript.INHERITED (1) (see < > http://www.icu-project.org/~mow/ICU4JCodeCoverage/Current/com/ibm/icu/lang/UScript.html>), > so there will be no script boundary detected between a character from an > oriental script followed by an ASCII digit, or vice versa - the ASCII digit > will be assigned the same script as the preceding character. > > > > See UAX#24 for more info: < > http://www.unicode.org/reports/tr24/tr24-21.html> (that’s the Unicode > 6.3.0 version, which is supported by Lucene/Solr 4.9). > > So the punctuation isn't considered break-worthy? > > This input: > > [政 治],100foo > > Becomes 政 治, 100, and foo. > > Thanks, > Shawn > >