Hello Lucene Developers, I am writing to propose an enhancement to the nori (Korean analysis) module regarding the handling of decimal points in numeric tokens.
Currently, the *KoreanTokenizer* in Nori splits numbers with decimal points into multiple tokens. For example, the phrase *"10.1인치 모니터"* is tokenized as: *["10", ".", "1", "인치", "모니터"]* This behavior makes it difficult to search for specific numeric values or measurements. While KoreanNumberFilter exists, it has clear limitations in handling these cases during the initial tokenization phase. I have developed a patch (attached as keep_decimal_point.patch) that introduces a keepDecimalPoint configuration option to KoreanTokenizer. Since kuromoji (Japanese analysis) shares a similar Viterbi-based architecture and exhibits the same behavior regarding decimal points, I believe a similar logic could be applied to the kuromoji module to improve consistency across East Asian language analyzers. I would greatly appreciate it if the maintainers and developers using Nori or Kuromoji could review this patch. I am open to any feedback or suggestions on the implementation. Best regards, SOMANG LEE -- LEE SOMANG Senior Assistant, Backend Cell The Chosunilbo, 30, Sejong-daero 21-gil, Jung-gu, Seoul, Korea (zip code: 04519) M : +82 10-8940-5081 E : [email protected]
keep_decimal_point.patch
Description: Binary data
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
