Hi all.. Following the advice at https://issues.apache.org/jira I'm explaining my situation here before creating an issue.
The short version is that the ICU tokenizer can split tokens differently after a space depending on what comes *before* the space. For example, *x 14th* is tokenized as x | 14th; *ァ 14th* is tokenized as ァ | 14 | th. The generalization is that if the writing system before the space is the same as the writing system after the number, then you get two tokens. If the writing systems differ, you get three tokens. The twist: chunking in the tokenizer breaks incoming text into ~4k chunks at whitespace, so changing unrelated text thousands of characters away can cause the chunking to change, which can cause the tokenization to change. I've gotten this to happen both by editing the text and removing a few characters, and by adding a char filter that can delete characters before tokenization—both of which shift the chunking boundary. I originally reported this as a bug in Elasticsearch <https://github.com/elastic/elasticsearch/issues/27290>, where I have included details on my system and steps to reproduce the problem. That ticket has been waiting for a few years, and looking into it when it cropped up again recently I realized it is probably an upstream problem, so I wanted to open an issue for Lucene. Is this a known issue, or should I create a new ticket? Thanks! —Trey Trey Jones Sr. Computational Linguist, Search Platform Wikimedia Foundation UTC-5 / EST