Re: ICUTokenizer acting very strangely with oriental characters

Steve Rowe Thu, 14 Aug 2014 14:05:14 -0700

On Aug 13, 2014, at 1:53 PM, Shawn Heisey <s...@elyograg.org> wrote:

> On 8/12/2014 9:13 PM, Steve Rowe wrote:
>> In the table below, the "IsSameS" (is same script) and "SBreak?" (script
>> break = not IsSameS) decisions are based on what I mentioned in my previous
>> message, and the "WBreak" (word break) decision is based on UAX#29 word
>> break rules:
>> 
>> Char    Code Point   Script        IsSameS?    SBreak?  WBreak?
>> ------    --------------   -------        -------------    ---------
>> -----------
>> 治        U+6CBB       Han          Yes              No            Yes
>> ]          U+005D        Common   Yes              No            Yes
>> ,          U+002C        Common   Yes              No            Yes
>> 1         U+0031         Common   --                 --              --
>> 
>> First, script boundaries are found and used as token boundaries - in the
>> above case, no script boundary is found between "治" and "1" - and then
>> UAX#29 word break rules are used to find token boundaries inbetween script
>> boundaries - in the above case, there are word boundaries between each
>> character, but ICUTokenizer throws away punctuation-only sequences between
>> token boundaries.
> 
> What should we use as a dividing character for situations like this? 
> Should we tell our customer that they can't start keywords like this
> (for searching/filtering) with a number?

Assuming you don’t want to add new features to ICUTokenizer (like maybe 
treating Common script chars in ASCII as if they were in the Latin script):

1. Yes, you could tell the customer to start all Latin-script-containing 
keywords with a Latin script character (which ASCII digits are not; as 
described above, they are in the Common script).

2. You could use a separator that forces the script to become Latin, e.g. 
“;BBllAAhh;” (w/o the quotes), and then use a stop filter to remove them (e.g. 
“BBllAAhh” in this case (w/o the quotes) - you’ll want to choose something that 
won’t ever occur as a meaningful token.

That’s all I can think of ATM.

Steve

Re: ICUTokenizer acting very strangely with oriental characters

Reply via email to