[ https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17295585#comment-17295585 ]
Robert Muir commented on LUCENE-9754: ------------------------------------- I explained here what happens in the first comment, but you didn't like my answer. Please re-read my answer, especially "That's because this tokenizer first divides on scripts". > ICU Tokenizer: letter-space-number-letter tokenized inconsistently > ------------------------------------------------------------------ > > Key: LUCENE-9754 > URL: https://issues.apache.org/jira/browse/LUCENE-9754 > Project: Lucene - Core > Issue Type: Bug > Components: core/search > Affects Versions: 7.5 > Environment: Tested most recently on Elasticsearch 6.5.4. > Reporter: Trey Jones > Priority: Major > Attachments: LUCENE-9754_prototype.patch > > > The tokenization of strings like _14th_ with the ICU tokenizer is affected by > the character that comes before preceeding whitespace. > For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | > 14 | th. > In general, in a letter-space-number-letter sequence, if the writing system > before the space is the same as the writing system after the number, then you > get two tokens. If the writing systems differ, you get three tokens. > If the conditions are just right, the chunking that the ICU tokenizer does > (trying to split on spaces to create <4k chunks) can create an artificial > boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the > unexpected split of the second token (_14th_). Because chunking changes can > ripple through a long document, editing text or the effects of a character > filter can cause changes in tokenization thousands of lines later in a > document. > My guess is that some "previous character set" flag is not reset at the > space, and numbers are not in a character set, so _t_ is compared to _ァ_ and > they are not the same—causing a token split at the character set change—but > I'm not sure. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org