Re: ICUTokenizer acting very strangely with oriental characters

Steve Rowe Tue, 12 Aug 2014 20:15:12 -0700

In the table below, the "IsSameS" (is same script) and "SBreak?" (script
break = not IsSameS) decisions are based on what I mentioned in my previous
message, and the "WBreak" (word break) decision is based on UAX#29 word
break rules:


Char    Code Point   Script        IsSameS?    SBreak?  WBreak?
------    --------------   -------        -------------    ---------
-----------
治        U+6CBB       Han          Yes              No            Yes
]          U+005D        Common   Yes              No            Yes
,          U+002C        Common   Yes              No            Yes
1         U+0031         Common   --                 --              --

First, script boundaries are found and used as token boundaries - in the
above case, no script boundary is found between "治" and "1" - and then
UAX#29 word break rules are used to find token boundaries inbetween script
boundaries - in the above case, there are word boundaries between each
character, but ICUTokenizer throws away punctuation-only sequences between
token boundaries.

Steve
www.lucidworks.com


On Tue, Aug 12, 2014 at 9:01 PM, Shawn Heisey <s...@elyograg.org> wrote:

> On 8/12/2014 6:29 PM, Steve Rowe wrote:
> > Shawn,
> >
> > ICUTokenizer is operating as designed here.
> >
> > The key to understanding this is
> o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from
> ScriptIterator.next() with the scripts of two consecutive characters; these
> methods together find script boundaries.  Here’s
> ScriptIterator.isSameScript():
> >
> >   /** Determine if two scripts are compatible. */
> >   private static boolean isSameScript(int scriptOne, int scriptTwo) {
> >     return scriptOne <= UScript.INHERITED || scriptTwo <=
> UScript.INHERITED
> >         || scriptOne == scriptTwo;
> >   }
> >
> > ASCII digits are in the Unicode script named “Common” (see <
> http://www.unicode.org/Public/6.3.0/ucd/Scripts.txt>), and UScript.COMMON
> (0) is less than UScript.INHERITED (1) (see <
> http://www.icu-project.org/~mow/ICU4JCodeCoverage/Current/com/ibm/icu/lang/UScript.html>),
> so there will be no script boundary detected between a character from an
> oriental script followed by an ASCII digit, or vice versa - the ASCII digit
> will be assigned the same script as the preceding character.
> >
> > See UAX#24 for more info: <
> http://www.unicode.org/reports/tr24/tr24-21.html> (that’s the Unicode
> 6.3.0 version, which is supported by Lucene/Solr 4.9).
>
> So the punctuation isn't considered break-worthy?
>
> This input:
>
> [政 治],100foo
>
> Becomes 政 治, 100, and foo.
>
> Thanks,
> Shawn
>
>

Re: ICUTokenizer acting very strangely with oriental characters

Reply via email to