On Aug 13, 2014, at 1:53 PM, Shawn Heisey wrote:
> On 8/12/2014 9:13 PM, Steve Rowe wrote:
>> In the table below, the "IsSameS" (is same script) and "SBreak?" (script
>> break = not IsSameS) decisions are based on what I mentioned in my previous
>> message, and the "WBreak" (word break) decision
On 8/12/2014 9:13 PM, Steve Rowe wrote:
> In the table below, the "IsSameS" (is same script) and "SBreak?" (script
> break = not IsSameS) decisions are based on what I mentioned in my previous
> message, and the "WBreak" (word break) decision is based on UAX#29 word
> break rules:
>
> CharCode
In the table below, the "IsSameS" (is same script) and "SBreak?" (script
break = not IsSameS) decisions are based on what I mentioned in my previous
message, and the "WBreak" (word break) decision is based on UAX#29 word
break rules:
CharCode Point ScriptIsSameS?SBreak? WBreak?
On 8/12/2014 6:29 PM, Steve Rowe wrote:
> Shawn,
>
> ICUTokenizer is operating as designed here.
>
> The key to understanding this is
> o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from
> ScriptIterator.next() with the scripts of two consecutive characters; these
> m
Shawn,
ICUTokenizer is operating as designed here.
The key to understanding this is
o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from
ScriptIterator.next() with the scripts of two consecutive characters; these
methods together find script boundaries. Here’s ScriptIt
mmn
jnbbbjb)n9nooon
Sent from my HTC
- Reply message -
From: "Shawn Heisey"
To: "solr-user@lucene.apache.org"
Subject: ICUTokenizer acting very strangely with oriental characters
Date: Tue, Aug 12, 2014 19:00
See the original message on this thread for full details. Some
addi
See the original message on this thread for full details. Some
additional information:
This happens on version 4.6.1, 4.7.2, and 4.9.0. Here is a screenshot
showing the analysis problem in more detail. The first line you can see
is the ICUTokenizer.
https://www.dropbox.com/s/9wbi7lz77ivya9j/IC