[jira] [Created] (LUCENE-9581) Clarify discardCompoundToken behavior in the JapaneseTokenizer

Jim Ferenczi (Jira) Tue, 20 Oct 2020 03:32:20 -0700

Jim Ferenczi created LUCENE-9581:
------------------------------------

             Summary: Clarify discardCompoundToken behavior in the 
JapaneseTokenizer
                 Key: LUCENE-9581
                 URL: https://issues.apache.org/jira/browse/LUCENE-9581
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Jim Ferenczi



At first sight, the discardCompoundToken option added in LUCENE-9123 seems 
redundant with the NORMAL mode of the Japanese tokenizer. When set to true, the 
current behavior is to disable the decomposition for compounds, that's exactly 
what the NORMAL mode does.

So I wonder if the right semantic of the option would be to keep only the 
decomposition of the compound or if it's really needed. If the goal is to make 
the output compatible with a graph token filter, the current workaround to set 
the mode to NORMAL should be enough.

That's consistent with the mode that should be used to preserve positions in 
the index since we don't handle position length on the indexing side. 

Am I missing something regarding the new option ? Is there a compelling case 
where it differs from the NORMAL mode ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-9581) Clarify discardCompoundToken behavior in the JapaneseTokenizer

Reply via email to