Incheonkirin opened a new pull request, #16242: URL: https://github.com/apache/lucene/pull/16242
Addresses #16241 ## Summary Add an opt-in HangulCompositionCharFilter to analysis-nori. The filter composes modern Hangul conjoining-jamo sequences into precomposed Hangul syllables before KoreanTokenizer, so NFD-form Korean text can analyze like the equivalent NFC text while preserving offset correction back to the original input. The filter is intentionally narrow: it handles only modern L/V/optional-T conjoining jamo sequences and leaves compatibility jamo, archaic jamo, partial sequences, already-precomposed Korean text, and non-Hangul text unchanged. For general Unicode normalization the ICU module's ICUNormalizer2CharFilter remains the right tool; this covers the common Korean-only case without adding the ICU dependency to nori deployments. ## Tests - NFD Korean sentence through HangulCompositionCharFilter + KoreanTokenizer matches NFC KoreanTokenizer terms/POS - offsets from analyzed NFD text map back to the original NFD input - randomized modern Hangul NFD composition matches NFC - non-modern and partial jamo sequences unchanged - already-NFC and no-op inputs unchanged - precomposed-LV + trailing jamo passthrough (out-of-scope shape unchanged) - factory registration - bogus factory arguments - random analyzer data ## Verification - ./gradlew :lucene:analysis:nori:tidy - ./gradlew :lucene:analysis:nori:check - ./gradlew check -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
