[ 
https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134428#comment-17134428
 ] 

Jun Ohtani commented on LUCENE-9390:
------------------------------------

I also checked *UniDic* around punctuation character, because I was working on 
[https://github.com/apache/lucene-solr/pull/935] .
 # word that starts punctuation character : 606 words. 222  words that length > 
1
 # word that all punctuation character : 111 words
 # word that has punctuation without 1st char: 1780 words

> Kuromoji tokenizer discards tokens if they start with a punctuation character
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-9390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9390
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>
> This issue was first raised in Elasticsearch 
> [here|https://github.com/elastic/elasticsearch/issues/57614]
> The unidic dictionary that is used by the Kuromoji tokenizer contains entries 
> that mix punctuations and other characters. For instance the following entry:
> _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_
> can be found in the Noun.csv file.
> Today, tokens that start with punctuations are automatically removed by 
> default (discardPunctuation  is true). I think the code was written this way 
> because we expect punctuations to be separated from normal tokens but there 
> are exceptions in the original dictionary. Maybe we should check the entire 
> token when discarding punctuations ?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to