[ https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134428#comment-17134428 ]
Jun Ohtani commented on LUCENE-9390: ------------------------------------ I also checked *UniDic* around punctuation character, because I was working on [https://github.com/apache/lucene-solr/pull/935] . # word that starts punctuation character : 606 words. 222 words that length > 1 # word that all punctuation character : 111 words # word that has punctuation without 1st char: 1780 words > Kuromoji tokenizer discards tokens if they start with a punctuation character > ----------------------------------------------------------------------------- > > Key: LUCENE-9390 > URL: https://issues.apache.org/jira/browse/LUCENE-9390 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Jim Ferenczi > Priority: Minor > > This issue was first raised in Elasticsearch > [here|https://github.com/elastic/elasticsearch/issues/57614] > The unidic dictionary that is used by the Kuromoji tokenizer contains entries > that mix punctuations and other characters. For instance the following entry: > _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_ > can be found in the Noun.csv file. > Today, tokens that start with punctuations are automatically removed by > default (discardPunctuation is true). I think the code was written this way > because we expect punctuations to be separated from normal tokens but there > are exceptions in the original dictionary. Maybe we should check the entire > token when discarding punctuations ? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org