[
https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125542#comment-17125542
]
Tomoko Uchida commented on LUCENE-9390:
---------------------------------------
Personally, I usually set the "discardPunctuation" flag to False to avoid such
subtle situation.
As a possible solution, instead of "discardPunctuation" flag we could add a
token filter to discard tokens that remove all tokens which is composed only of
punctuation characters after tokenization (just like stop filter) ? To me, it
is a token filter's job rather than a tokenizer...
> Kuromoji tokenizer discards tokens if they start with a punctuation character
> -----------------------------------------------------------------------------
>
> Key: LUCENE-9390
> URL: https://issues.apache.org/jira/browse/LUCENE-9390
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Jim Ferenczi
> Priority: Minor
>
> This issue was first raised in Elasticsearch
> [here|https://github.com/elastic/elasticsearch/issues/57614]
> The unidic dictionary that is used by the Kuromoji tokenizer contains entries
> that mix punctuations and other characters. For instance the following entry:
> _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_
> can be found in the Noun.csv file.
> Today, tokens that start with punctuations are automatically removed by
> default (discardPunctuation is true). I think the code was written this way
> because we expect punctuations to be separated from normal tokens but there
> are exceptions in the original dictionary. Maybe we should check the entire
> token when discarding punctuations ?
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]