[
https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283105#comment-17283105
]
Michael Sokolov commented on LUCENE-9754:
-----------------------------------------
Would it make sense to have the ability to treat digits as Latin script? I
think we ended up doing that in order to be able to apply (maybe
anglo-euro-centric?) number constructs that nevertheless do appear in
multilingual texts like units (1", 1ft, 1m., etc., ranges 1-10, ordinals like
1st 2nd etc)
> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> ------------------------------------------------------------------
>
> Key: LUCENE-9754
> URL: https://issues.apache.org/jira/browse/LUCENE-9754
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
> Affects Versions: 7.5
> Environment: Tested most recently on Elasticsearch 6.5.4.
> Reporter: Trey Jones
> Priority: Major
>
> The tokenization of strings like _14th_ with the ICU tokenizer is affected by
> the character that comes before preceeding whitespace.
> For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ |
> 14 | th.
> In general, in a letter-space-number-letter sequence, if the writing system
> before the space is the same as the writing system after the number, then you
> get two tokens. If the writing systems differ, you get three tokens.
> If the conditions are just right, the chunking that the ICU tokenizer does
> (trying to split on spaces to create <4k chunks) can create an artificial
> boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the
> unexpected split of the second token (_14th_). Because chunking changes can
> ripple through a long document, editing text or the effects of a character
> filter can cause changes in tokenization thousands of lines later in a
> document.
> My guess is that some "previous character set" flag is not reset at the
> space, and numbers are not in a character set, so _t_ is compared to _ァ_ and
> they are not the same—causing a token split at the character set change—but
> I'm not sure.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]