maomao905 opened a new issue, #11976:
URL: https://github.com/apache/lucene/issues/11976

   ### Description
   
   This issue comes from https://github.com/elastic/elasticsearch/issues/50008.
   When tokenizing combining characters (ex. `㋀`) after applying the char 
filter `icu_normalizer`, end offset of combining character is not incremented 
correctly.
   
   The test which I added in [TestICUNormalizer2CharFilter]( 
https://github.com/apache/lucene/blob/main/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUNormalizer2CharFilter.java)
 failed.
   ```java
   public void testTokenStreamCombiningCharacter() throws IOException {
     String input = "日日㋀日"; // ㋀ is the combining character
     CharFilter reader =
         new ICUNormalizer2CharFilter(
             new StringReader(input),
             Normalizer2.getInstance(null, "nfkc_cf", 
Normalizer2.Mode.COMPOSE));
   
     Tokenizer tokenStream =
         new ICUTokenizer(newAttributeFactory(), new 
DefaultICUTokenizerConfig(false, true));
     tokenStream.setReader(reader);
   
     assertTokenStreamContents(
         tokenStream,
         new String[] {"日", "日", "1", "月", "日"},
         new int[] {0, 1, 2, 3, 4}, // test pass if changed to {0, 1, 2, 2, 3}
         new int[] {1, 2, 3, 4, 5}, // test pass if changed to {1, 2, 2, 3, 4} 
(end offset for the word `1` is not incremented)
         input.length());
   }
   ```
   ```
   $ ./gradlew test --tests 
org.apache.lucene.analysis.icu.TestICUNormalizer2CharFilter.testTokenStreamCombiningCharacter
   org.apache.lucene.analysis.icu.TestICUNormalizer2CharFilter > 
testTokenStreamCombiningCharacter FAILED
       java.lang.AssertionError: endOffset 2 term=1 expected:<3> but was:<2>
   ```
   
   ### Version and environment details
   
   - macOS 12.3.1
   - openjdk 17.0.5


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to