On Aug 13, 2014, at 1:53 PM, Shawn Heisey <s...@elyograg.org> wrote:
> On 8/12/2014 9:13 PM, Steve Rowe wrote: >> In the table below, the "IsSameS" (is same script) and "SBreak?" (script >> break = not IsSameS) decisions are based on what I mentioned in my previous >> message, and the "WBreak" (word break) decision is based on UAX#29 word >> break rules: >> >> Char Code Point Script IsSameS? SBreak? WBreak? >> ------ -------------- ------- ------------- --------- >> ----------- >> 治 U+6CBB Han Yes No Yes >> ] U+005D Common Yes No Yes >> , U+002C Common Yes No Yes >> 1 U+0031 Common -- -- -- >> >> First, script boundaries are found and used as token boundaries - in the >> above case, no script boundary is found between "治" and "1" - and then >> UAX#29 word break rules are used to find token boundaries inbetween script >> boundaries - in the above case, there are word boundaries between each >> character, but ICUTokenizer throws away punctuation-only sequences between >> token boundaries. > > What should we use as a dividing character for situations like this? > Should we tell our customer that they can't start keywords like this > (for searching/filtering) with a number? Assuming you don’t want to add new features to ICUTokenizer (like maybe treating Common script chars in ASCII as if they were in the Latin script): 1. Yes, you could tell the customer to start all Latin-script-containing keywords with a Latin script character (which ASCII digits are not; as described above, they are in the Common script). 2. You could use a separator that forces the script to become Latin, e.g. “;BBllAAhh;” (w/o the quotes), and then use a stop filter to remove them (e.g. “BBllAAhh” in this case (w/o the quotes) - you’ll want to choose something that won’t ever occur as a meaningful token. That’s all I can think of ATM. Steve