[jira] [Commented] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

Trey Jones (Jira) Tue, 16 Feb 2021 11:33:07 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285443#comment-17285443
 ]


Trey Jones commented on LUCENE-9754:
------------------------------------

>  the rules you see don't mean what you might infer. You need to go to the 
> notes section

I have not carefully studied the entire annex, but I read through the notes 
right under the WB rules and I don't see anything that explains the behavior 
I'm seeing. Can you point to where I am misinterpreting 
[WB10|https://unicode.org/reports/tr29/#WB10] (under the label _Do not break 
within sequences of digits, or digits adjacent to letters (“3a”, or “A3”)_), or 
ignoring an interaction some other rule or detail in the notes? The ICU 
Tokenizer tokenizes _p 3a π 3a_ as p | 3a | π | 3 | a —so the second instance 
of _3a_ is split apart, but the first is not. As best I can read it, that is a 
direct violation of WB10 and the example in the label above it.

> this issue just has to do with stupid low-level text buffering

I respectfully disagree. I tracked down the low-level buffering situation 
because of the unexpected inconsistency—which is definitely weird to an end 
user—but it is not the point of the ticket. The inconsistency in how _3a_ is 
tokenized following _p_ vs _π_ is my main concern, so I'd appreciate it if you 
could explain how UAX #29 explains that behavior, so I can understand it and 
explain it to other people on my end.


> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> ------------------------------------------------------------------
>
>                 Key: LUCENE-9754
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9754
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.5
>         Environment: Tested most recently on Elasticsearch 6.5.4.
>            Reporter: Trey Jones
>            Priority: Major
>         Attachments: LUCENE-9754_prototype.patch
>
>
> The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
> the character that comes before preceeding whitespace.
> For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 
> 14 | th.
> In general, in a letter-space-number-letter sequence, if the writing system 
> before the space is the same as the writing system after the number, then you 
> get two tokens. If the writing systems differ, you get three tokens.
> If the conditions are just right, the chunking that the ICU tokenizer does 
> (trying to split on spaces to create <4k chunks) can create an artificial 
> boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
> unexpected split of the second token (_14th_). Because chunking changes can 
> ripple through a long document, editing text or the effects of a character 
> filter can cause changes in tokenization thousands of lines later in a 
> document.
> My guess is that some "previous character set" flag is not reset at the 
> space, and numbers are not in a character set, so _t_ is compared to _ァ_ and 
> they are not the same—causing a token split at the character set change—but 
> I'm not sure.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

Reply via email to