[jira] [Commented] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

Robert Muir (Jira) Thu, 11 Feb 2021 07:18:05 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283075#comment-17283075
 ]


Robert Muir commented on LUCENE-9754:
-------------------------------------

Yes, the issue is because of the chunking. Put aside long documents for a 
second, imagine short document such as {{ァ 14th}} then it will always first be 
split as {{ァ 14|th}}:

That's because this tokenizer first divides on scripts, and lets you use 
different strategy per-script. These numbers have script code of "Common", and 
things like accent marks have script code of "Inherited", these 
"Common/Inherited" are "sticky". So under normal conditions it does not break 
until it hits the 't' ("Latin"). Maybe it is seen as undesirable in this 
example, but that is just the tradeoff the tokenizer makes (splitting on 
scripts). You can find more discussions on that in Notes section of 
https://unicode.org/reports/tr29/#Word_Boundary_Rules

But if you feed it super long document, we can't read the whole document into 
RAM at once, so we have to limit to 4k chunk. and the chunking may split on 
that space before the script analysis runs {{ァ|14th}}. This leads to the 
inconsistency that you see. 

For super long documents, the current behavior of this tokenizer will be 
annoying. The chunking/4k was written more to be a failsafe than anything else: 
I don't think a little tweak here or there to this tokenizer will help. 

One idea: change the tokenizer to chunk "sentence-at-a-time" based on sentence 
boundaries first. It might add a little overhead, but then long documents would 
work consistently. The behavior of this chunking would be easier for user to 
understand, word segmenter only sees "one sentence" of context at a time.

> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> ------------------------------------------------------------------
>
>                 Key: LUCENE-9754
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9754
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.5
>         Environment: Tested most recently on Elasticsearch 6.5.4.
>            Reporter: Trey Jones
>            Priority: Major
>
> The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
> the character that comes before preceeding whitespace.
> For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 
> 14 | th.
> In general, in a letter-space-number-letter sequence, if the writing system 
> before the space is the same as the writing system after the number, then you 
> get two tokens. If the writing systems differ, you get three tokens.
> If the conditions are just right, the chunking that the ICU tokenizer does 
> (trying to split on spaces to create <4k chunks) can create an artificial 
> boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
> unexpected split of the second token (_14th_). Because chunking changes can 
> ripple through a long document, editing text or the effects of a character 
> filter can cause changes in tokenization thousands of lines later in a 
> document.
> My guess is that some "previous character set" flag is not reset at the 
> space, and numbers are not in a character set, so _t_ is compared to _ァ_ and 
> they are not the same—causing a token split at the character set change—but 
> I'm not sure.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

Reply via email to