[
https://issues.apache.org/jira/browse/LUCENE-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237176#comment-17237176
]
ASF subversion and git services commented on LUCENE-9581:
---------------------------------------------------------
Commit a5d0654a2469c92bf02497e8fd18587058cb1a96 in lucene-solr's branch
refs/heads/master from jimczi
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a5d0654 ]
LUCENE-9581: Japanese tokenizer should discard the compound token instead of
disabling the decomposition of long tokens when discardCompoundToken is
activated.
> Clarify discardCompoundToken behavior in the JapaneseTokenizer
> --------------------------------------------------------------
>
> Key: LUCENE-9581
> URL: https://issues.apache.org/jira/browse/LUCENE-9581
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Jim Ferenczi
> Priority: Minor
> Attachments: LUCENE-9581.patch, LUCENE-9581.patch, LUCENE-9581.patch
>
>
> At first sight, the discardCompoundToken option added in LUCENE-9123 seems
> redundant with the NORMAL mode of the Japanese tokenizer. When set to true,
> the current behavior is to disable the decomposition for compounds, that's
> exactly what the NORMAL mode does.
> So I wonder if the right semantic of the option would be to keep only the
> decomposition of the compound or if it's really needed. If the goal is to
> make the output compatible with a graph token filter, the current workaround
> to set the mode to NORMAL should be enough.
> That's consistent with the mode that should be used to preserve positions in
> the index since we don't handle position length on the indexing side.
> Am I missing something regarding the new option ? Is there a compelling case
> where it differs from the NORMAL mode ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]