maomao905 opened a new issue, #11976: URL: https://github.com/apache/lucene/issues/11976
### Description This issue comes from https://github.com/elastic/elasticsearch/issues/50008. When tokenizing combining characters (ex. `㋀`) after applying the char filter `icu_normalizer`, end offset of combining character is not incremented correctly. The test which I added in [TestICUNormalizer2CharFilter]( https://github.com/apache/lucene/blob/main/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUNormalizer2CharFilter.java) failed. ```java public void testTokenStreamCombiningCharacter() throws IOException { String input = "日日㋀日"; // ㋀ is the combining character CharFilter reader = new ICUNormalizer2CharFilter( new StringReader(input), Normalizer2.getInstance(null, "nfkc_cf", Normalizer2.Mode.COMPOSE)); Tokenizer tokenStream = new ICUTokenizer(newAttributeFactory(), new DefaultICUTokenizerConfig(false, true)); tokenStream.setReader(reader); assertTokenStreamContents( tokenStream, new String[] {"日", "日", "1", "月", "日"}, new int[] {0, 1, 2, 3, 4}, // test pass if changed to {0, 1, 2, 2, 3} new int[] {1, 2, 3, 4, 5}, // test pass if changed to {1, 2, 2, 3, 4} (end offset for the word `1` is not incremented) input.length()); } ``` ``` $ ./gradlew test --tests org.apache.lucene.analysis.icu.TestICUNormalizer2CharFilter.testTokenStreamCombiningCharacter org.apache.lucene.analysis.icu.TestICUNormalizer2CharFilter > testTokenStreamCombiningCharacter FAILED java.lang.AssertionError: endOffset 2 term=1 expected:<3> but was:<2> ``` ### Version and environment details - macOS 12.3.1 - openjdk 17.0.5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org