[jira] [Comment Edited] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character

Jun Ohtani (Jira) Fri, 12 Jun 2020 07:46:10 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134249#comment-17134249
 ]


Jun Ohtani edited comment on LUCENE-9390 at 6/12/20, 2:45 PM:
--------------------------------------------------------------

I counted 3 types of words in ipadic csv files.
 # word that starts punctuation character : 101 words. only 4 words that length 
> 1
 # word that all punctuation character : 3 words
 # word that has punctuation without 1st char: 723 words

For no.3, just counted because I was curious it. 

Reference : Word list.

 [https://gist.github.com/johtani/50aa2776a385c5c8dfa3a0d1e4e268cd]

4 words that starts punctuation are below:
（社）
 （財）
 （有）
 （株）

all punctuation words are :

——
 −−
 ──
  


was (Author: jun_o):
I counted 3 types of words in ipadic csv files. 
 # word that starts punctuation character : 104 words. only 7 words that length 
> 1
 # word that all punctuation character : 0 words
 # word that has punctuation without 1st char: 723 words

Word list.
 [https://gist.github.com/johtani/50aa2776a385c5c8dfa3a0d1e4e268cd]



7 words are below:
——
−−
──
（社）
（財）
（有）
（株）
 

> Kuromoji tokenizer discards tokens if they start with a punctuation character
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-9390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9390
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>
> This issue was first raised in Elasticsearch 
> [here|https://github.com/elastic/elasticsearch/issues/57614]
> The unidic dictionary that is used by the Kuromoji tokenizer contains entries 
> that mix punctuations and other characters. For instance the following entry:
> _（株）,1285,1285,3690,名詞,一般,*,*,*,*,（株）,カブシキガイシャ,カブシキガイシャ_
> can be found in the Noun.csv file.
> Today, tokens that start with punctuations are automatically removed by 
> default (discardPunctuation  is true). I think the code was written this way 
> because we expect punctuations to be separated from normal tokens but there 
> are exceptions in the original dictionary. Maybe we should check the entire 
> token when discarding punctuations ?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character

Reply via email to