Hello Lucene Developers,
I am writing to propose an enhancement to the nori (Korean analysis) module
regarding the handling of decimal points in numeric tokens.

Currently, the *KoreanTokenizer* in Nori splits numbers with decimal points
into multiple tokens. For example, the phrase *"10.1인치 모니터"* is tokenized
as:

*["10", ".", "1", "인치", "모니터"]*
This behavior makes it difficult to search for specific numeric values or
measurements. While KoreanNumberFilter exists, it has clear limitations in
handling these cases during the initial tokenization phase.


I have developed a patch (attached as keep_decimal_point.patch) that
introduces a keepDecimalPoint configuration option to KoreanTokenizer.

Since kuromoji (Japanese analysis) shares a similar Viterbi-based
architecture and exhibits the same behavior regarding decimal points, I
believe a similar logic could be applied to the kuromoji module to improve
consistency across East Asian language analyzers.

I would greatly appreciate it if the maintainers and developers using Nori
or Kuromoji could review this patch. I am open to any feedback or
suggestions on the implementation.

Best regards,

SOMANG LEE


-- 

LEE SOMANG
Senior Assistant, Backend Cell

The Chosunilbo, 30, Sejong-daero 21-gil, Jung-gu, Seoul, Korea (zip code:
04519)
M : +82 10-8940-5081
E : [email protected]

Attachment: keep_decimal_point.patch
Description: Binary data

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to