[ https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284030#comment-17284030 ]
Robert Muir commented on LUCENE-9754: ------------------------------------- Sorry, I think this tokenizer works behind-the-scenes differently than you imagine: if you want a more pure unicode-standard tokenizer, then use {{StandardTokenizer}}. But {{ICUTokenizer}} differs from {{StandardTokenizer}} in that it tries to track more modern unicode standards, and as i mentioned in my comment above, it first chunks inpuit, then divides on scripts. This lets someone customize how the tokenization works for a particular writing system. And we give options for the tricky ones (e.g. thai/lao/burmese/whatever) that are usable in case the JDK might not be. With no disprespect intended, the rules you see don't mean what you might infer. You need to go to the notes section :) If we encounter text in thai etc, ICU dictionary takes care. But we also let the end-user supply their own rules, in case they want something different. The differences in this issue just has to do with stupid low-level text buffering, when segmentation usually just needs sentence context.. and from the NLP perspective, that is typically what it is trained on. So it makes sense to chunk on sentences rather than ranges and devolving to spaces. That's the issue the base {{SegmentingTokenizerBase}} fixes, for its subclasses (e.g. CJK), we should fix it here too. I dont care about how good or terrible UAX29 sentence segmentation is, i want to use it for chunking. if you don't like it, you can optionally provide own rules you think are better. That is how i feel about this from a search engine library perspective. > ICU Tokenizer: letter-space-number-letter tokenized inconsistently > ------------------------------------------------------------------ > > Key: LUCENE-9754 > URL: https://issues.apache.org/jira/browse/LUCENE-9754 > Project: Lucene - Core > Issue Type: Bug > Components: core/search > Affects Versions: 7.5 > Environment: Tested most recently on Elasticsearch 6.5.4. > Reporter: Trey Jones > Priority: Major > Attachments: LUCENE-9754_prototype.patch > > > The tokenization of strings like _14th_ with the ICU tokenizer is affected by > the character that comes before preceeding whitespace. > For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | > 14 | th. > In general, in a letter-space-number-letter sequence, if the writing system > before the space is the same as the writing system after the number, then you > get two tokens. If the writing systems differ, you get three tokens. > If the conditions are just right, the chunking that the ICU tokenizer does > (trying to split on spaces to create <4k chunks) can create an artificial > boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the > unexpected split of the second token (_14th_). Because chunking changes can > ripple through a long document, editing text or the effects of a character > filter can cause changes in tokenization thousands of lines later in a > document. > My guess is that some "previous character set" flag is not reset at the > space, and numbers are not in a character set, so _t_ is compared to _ァ_ and > they are not the same—causing a token split at the character set change—but > I'm not sure. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org