[jira] [Commented] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

Trey Jones (Jira) Thu, 04 Mar 2021 15:00:07 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17295622#comment-17295622
 ]


Trey Jones commented on LUCENE-9754:
------------------------------------

I appreciate that this is frustrating, and I’m sorry that we seem to be 
frustrating each other. You seem to feel that I am not listening to what you 
have to say, which is no surprise, since I feel that you are not listening to 
what I have to say. Can we try again to meet somewhere in the middle?
{quote}That's because this tokenizer first divides on scripts
{quote}
I’m trying my best to hear what you are saying here. The current behavior is 
the result of the tokenizer splitting on scripts before splitting on spaces. 
This does in fact completely explain the output we see in the _p 3a π 3a_ 
example.

However, what the tokenizer _does_ and what the tokenizer is _supposed to do_ 
are not necessarily the same thing.

I read your comments as offering the Word Boundary Rules and related Notes from 
Annex 29 as justification for the tokenizer’s behavior. I read over them, and I 
don’t see a justification there. Rather, I see a specific concrete example of 
what _*not*_ to do—splitting _3a_—yet the tokenizer seems to do exactly that.

So, I do actually like your answer, but I don’t like the question that goes 
with it, which seems to be, “Why does the tokenizer do that?” *The question I’m 
trying to ask is, “Is this what the tokenizer _should_ do?”*

My opinion is obviously that this is not what it should do—but opinions can 
differ. My reading of the documentation you suggested is _also_ that this is 
not what the tokenizer should do. I’m willing to accept the possibility that I 
have read UAX29 and WB10 and the example given there incorrectly, but I’m going 
to need a little help seeing it.

Your previous comments have not provided the elucidation that I seek:
{quote}That's because this tokenizer first divides on scripts
{quote}
This explains why it behaves as it does, not why that is the desired behavior.
{quote}You can find more discussions on that in Notes section of 
[https://unicode.org/reports/tr29/#Word_Boundary_Rules]
{quote}
These rules and notes seem to contradict the behavior of the tokenizer.
{quote}I think this tokenizer works behind-the-scenes differently than you 
imagine
{quote}
I believe that I understand what it does—as you said, it divides on scripts—but 
that doesn’t explain why that is the right thing to do.
{quote}the rules you see don't mean what you might infer
{quote}
I infer that _3a,_ the example give in the rules, should not be split. If that 
is the wrong inference, please make some small attempt to explain _why,_ rather 
than implying that I don’t understand, or telling me _what_ the tokenizer does 
to get this behavior, which seems no less incorrect for being explainable.

I hope we can give this one more go and find a productive consensus on whether 
the current tokenizer behavior is correct, and if so, some insight into why.

Thanks for the time you've put into this discussion.

> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> ------------------------------------------------------------------
>
>                 Key: LUCENE-9754
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9754
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.5
>         Environment: Tested most recently on Elasticsearch 6.5.4.
>            Reporter: Trey Jones
>            Priority: Major
>         Attachments: LUCENE-9754_prototype.patch
>
>
> The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
> the character that comes before preceeding whitespace.
> For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 
> 14 | th.
> In general, in a letter-space-number-letter sequence, if the writing system 
> before the space is the same as the writing system after the number, then you 
> get two tokens. If the writing systems differ, you get three tokens.
> -If the conditions are just right, the chunking that the ICU tokenizer does 
> (trying to split on spaces to create <4k chunks) can create an artificial 
> boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
> unexpected split of the second token (_14th_). Because chunking changes can 
> ripple through a long document, editing text or the effects of a character 
> filter can cause changes in tokenization thousands of lines later in a 
> document.- _(This inconsistency was included as a side issue that I thought 
> might add more weight to the main problem I am concerned with, but it seems 
> to be more of a distraction. Chunking issues should perhaps be addressed in a 
> different ticket, so I'm striking it out.)_
> My guess is that some "previous character set" flag is not reset at the 
> space, and numbers are not in a character set, so _t_ is compared to _ァ_ and 
> they are not the same—causing a token split at the character set change—but 
> I'm not sure.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

Reply via email to