[ 
https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283971#comment-17283971
 ] 

Trey Jones commented on LUCENE-9754:
------------------------------------

The inconsistency caused by chunking is a very confusing, albeit rare, 
problem—but I don't think it is what needs to be fixed here. The chunking 
algorithm assumes that whitespace is a reasonable place to split tokens, and 
that should be a valid assumption.

Right now the ICU Tokenizer tokenizes _cat 14th γάτα 1ος cat 1ος γάτα 14th_ as 
_cat | 14th | γάτα | 1οσ | cat | 1 | οσ | γάτα | 14 | th._ Does anyone expect 
the tokenization of _14th_ or _1ος_ (Greek "1st") to depend on the word before 
it? It happens across punctuation too, so a word in a different _sentence_ can 
trigger different tokenization; in this example, "The top results are: 1st is 
the Greek word for cat, γάτα. 2nd is the French word for cat, chat. 3rd is ..." 
No one would reasonably expect that you would get the tokens _1st, 2, nd,_ and 
_3rd_ out of this, but that's what happens. (Splitting on sentences wouldn't 
solve this one either—just replace periods with semi-colons and it's one long 
sentence.)

The Word Boundary Rules that Robert linked to explicitly say _Do not break 
within sequences of digits, or digits adjacent to letters (“3a”, or “A3”)._ The 
[Unicode Segmentation 
Utility|https://util.unicode.org/UnicodeJsps/breaks.jsp?a=The%20top%20results%20are:%201st%20is%20the%20Greek%20word%20for%20cat,%20%CE%B3%CE%AC%CF%84%CE%B1.%202nd%20is%20the%20French%20word%20for%20cat,%20chat.%203rd%20is%20...]
 also doesn't split the tokens this way.

Like I said above, my guess is that there is a flag of some sort for "most 
recent character set" that should be reset to null or "none" or something at 
whitespace, line breaks, etc.

Other examples taken from English Wikipedia (it does not use the ICU Tokenizer, 
but it's a good place to find natural examples): resistor 1.5kΩ 12W (12|w); 
πρώτη 5G πόλη (5|G); the σ 2p has (2|p); Суворове в 3D (3|D); ФИБА 3x3 (3|x3); 
интерконективен 400kV (400|kv); collection crosses रु 18cr mark (18|cr); 2019 
వేడుక 17th Santosham Awards (17|th); หลวงพี่แจ๊ส 4G (4|g); factor of 2π (2|π); 
50m-bazen.pdf 50м базен (50|м); hydroxyprednisolone 16α,17α-acetonide 
(16|α|17α); 

That last one is particularly egregious, since 16α is separated, but 17α is not.


> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> ------------------------------------------------------------------
>
>                 Key: LUCENE-9754
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9754
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.5
>         Environment: Tested most recently on Elasticsearch 6.5.4.
>            Reporter: Trey Jones
>            Priority: Major
>         Attachments: LUCENE-9754_prototype.patch
>
>
> The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
> the character that comes before preceeding whitespace.
> For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 
> 14 | th.
> In general, in a letter-space-number-letter sequence, if the writing system 
> before the space is the same as the writing system after the number, then you 
> get two tokens. If the writing systems differ, you get three tokens.
> If the conditions are just right, the chunking that the ICU tokenizer does 
> (trying to split on spaces to create <4k chunks) can create an artificial 
> boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
> unexpected split of the second token (_14th_). Because chunking changes can 
> ripple through a long document, editing text or the effects of a character 
> filter can cause changes in tokenization thousands of lines later in a 
> document.
> My guess is that some "previous character set" flag is not reset at the 
> space, and numbers are not in a character set, so _t_ is compared to _ァ_ and 
> they are not the same—causing a token split at the character set change—but 
> I'm not sure.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to