Possible bug report—ICU Tokenizer: letter-space-number-letter tokenized inconsistently

Trey Jones Fri, 22 Jan 2021 14:01:41 -0800

Hi all..

Following the advice at https://issues.apache.org/jira I'm explaining my
situation here before creating an issue.


The short version is that the ICU tokenizer can split tokens differently
after a space depending on what comes *before* the space. For example, *x
14th* is tokenized as x | 14th; *ァ 14th* is tokenized as ァ | 14 | th. The
generalization is that if the writing system before the space is the same
as the writing system after the number, then you get two tokens. If the
writing systems differ, you get three tokens.

The twist: chunking in the tokenizer breaks incoming text into ~4k chunks
at whitespace, so changing unrelated text thousands of characters away can
cause the chunking to change, which can cause the tokenization to change.
I've gotten this to happen both by editing the text and removing a few
characters, and by adding a char filter that can delete characters before
tokenization—both of which shift the chunking boundary.

I originally reported this as a bug in Elasticsearch
<https://github.com/elastic/elasticsearch/issues/27290>, where I have
included details on my system and steps to reproduce the problem. That
ticket has been waiting for a few years, and looking into it when it
cropped up again recently I realized it is probably an upstream problem, so
I wanted to open an issue for Lucene.

Is this a known issue, or should I create a new ticket?

Thanks!
—Trey

Trey Jones
Sr. Computational Linguist, Search Platform
Wikimedia Foundation
UTC-5 / EST

Possible bug report—ICU Tokenizer: letter-space-number-letter tokenized inconsistently

Reply via email to