Re: ICUTokenizer acting very strangely with oriental characters

2014-08-14 Thread Steve Rowe
On Aug 13, 2014, at 1:53 PM, Shawn Heisey wrote: > On 8/12/2014 9:13 PM, Steve Rowe wrote: >> In the table below, the "IsSameS" (is same script) and "SBreak?" (script >> break = not IsSameS) decisions are based on what I mentioned in my previous >> message, and the "WBreak" (word break) decision

Re: ICUTokenizer acting very strangely with oriental characters

2014-08-13 Thread Shawn Heisey
On 8/12/2014 9:13 PM, Steve Rowe wrote: > In the table below, the "IsSameS" (is same script) and "SBreak?" (script > break = not IsSameS) decisions are based on what I mentioned in my previous > message, and the "WBreak" (word break) decision is based on UAX#29 word > break rules: > > CharCode

Re: ICUTokenizer acting very strangely with oriental characters

2014-08-12 Thread Steve Rowe
In the table below, the "IsSameS" (is same script) and "SBreak?" (script break = not IsSameS) decisions are based on what I mentioned in my previous message, and the "WBreak" (word break) decision is based on UAX#29 word break rules: CharCode Point ScriptIsSameS?SBreak? WBreak?

Re: ICUTokenizer acting very strangely with oriental characters

2014-08-12 Thread Shawn Heisey
On 8/12/2014 6:29 PM, Steve Rowe wrote: > Shawn, > > ICUTokenizer is operating as designed here. > > The key to understanding this is > o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from > ScriptIterator.next() with the scripts of two consecutive characters; these > m

Re: ICUTokenizer acting very strangely with oriental characters

2014-08-12 Thread Steve Rowe
Shawn, ICUTokenizer is operating as designed here. The key to understanding this is o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from ScriptIterator.next() with the scripts of two consecutive characters; these methods together find script boundaries. Here’s ScriptIt

Re: ICUTokenizer acting very strangely with oriental characters

2014-08-12 Thread Rik Tamm-Daniels
mmn jnbbbjb)n9nooon Sent from my HTC - Reply message - From: "Shawn Heisey" To: "solr-user@lucene.apache.org" Subject: ICUTokenizer acting very strangely with oriental characters Date: Tue, Aug 12, 2014 19:00 See the original message on this thread for full details. Some addi

Re: ICUTokenizer acting very strangely with oriental characters

2014-08-12 Thread Shawn Heisey
See the original message on this thread for full details. Some additional information: This happens on version 4.6.1, 4.7.2, and 4.9.0. Here is a screenshot showing the analysis problem in more detail. The first line you can see is the ICUTokenizer. https://www.dropbox.com/s/9wbi7lz77ivya9j/IC