[jira] [Commented] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

Robert Muir (Jira) Thu, 04 Mar 2021 13:27:07 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17295585#comment-17295585
 ]


Robert Muir commented on LUCENE-9754:
-------------------------------------

I explained here what happens in the first comment, but you didn't like my 
answer. Please re-read my answer, especially "That's because this tokenizer 
first divides on scripts". 

> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> ------------------------------------------------------------------
>
>                 Key: LUCENE-9754
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9754
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.5
>         Environment: Tested most recently on Elasticsearch 6.5.4.
>            Reporter: Trey Jones
>            Priority: Major
>         Attachments: LUCENE-9754_prototype.patch
>
>
> The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
> the character that comes before preceeding whitespace.
> For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 
> 14 | th.
> In general, in a letter-space-number-letter sequence, if the writing system 
> before the space is the same as the writing system after the number, then you 
> get two tokens. If the writing systems differ, you get three tokens.
> If the conditions are just right, the chunking that the ICU tokenizer does 
> (trying to split on spaces to create <4k chunks) can create an artificial 
> boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
> unexpected split of the second token (_14th_). Because chunking changes can 
> ripple through a long document, editing text or the effects of a character 
> filter can cause changes in tokenization thousands of lines later in a 
> document.
> My guess is that some "previous character set" flag is not reset at the 
> space, and numbers are not in a character set, so _t_ is compared to _ァ_ and 
> they are not the same—causing a token split at the character set change—but 
> I'm not sure.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

Reply via email to