[jira] [Commented] (LUCENE-9581) Clarify discardCompoundToken behavior in the JapaneseTokenizer

Kazuaki Hiraga (Jira) Fri, 23 Oct 2020 09:58:09 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219808#comment-17219808
 ]


Kazuaki Hiraga commented on LUCENE-9581:
----------------------------------------

Thank you for your input, [~jimczi].  My patch was just showing the super easy 
approach to fix the issue as a short term solution (and I have tried to 
remember the discussion of a bit confusable options which is a different story 
from this issue, though).

Why I have modified the the minimal length for the penalty is that that is a 
similar idea what we can specify the behavior of MeCab's unknown word 
processing for the known words (In MeCab, not only Kanji characters but also 
others can be targeted and can be configurable by a configuration file, 
though), and I think `>=` is better in some cases (but this can be created 
another discussion). Anyway, this is a different story and I think your 
approach is appropriate for resolving the issue. So, I agree with your approach.

 
{quote}I am also unsure that we should make discardCompoundToken true by 
default in Lucene 9
{quote}
As we have discussed in LUCENE-9123, we want to change the default behavior of 
the current search mode that the tokenization results will be the same with 
`discardCompoundToken=true`. 

If I understand correctly, the result of the discussion is that 1) search mode 
will not return the compound tokens along with the decomposed tokens in Lucene 
9 (Tokenizer won't return the compound tokens unless explicitly 
`discardCompoundToken=false` is specified), 2) merge the normal mode and search 
mode to only return the decomposed tokens, and remove the mode and related 
parameters in Lucene 10(?). Any opinions / suggestions ?

 

> Clarify discardCompoundToken behavior in the JapaneseTokenizer
> --------------------------------------------------------------
>
>                 Key: LUCENE-9581
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9581
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Jim Ferenczi
>            Priority: Minor
>         Attachments: LUCENE-9581.patch, LUCENE-9581.patch
>
>
> At first sight, the discardCompoundToken option added in LUCENE-9123 seems 
> redundant with the NORMAL mode of the Japanese tokenizer. When set to true, 
> the current behavior is to disable the decomposition for compounds, that's 
> exactly what the NORMAL mode does.
> So I wonder if the right semantic of the option would be to keep only the 
> decomposition of the compound or if it's really needed. If the goal is to 
> make the output compatible with a graph token filter, the current workaround 
> to set the mode to NORMAL should be enough.
> That's consistent with the mode that should be used to preserve positions in 
> the index since we don't handle position length on the indexing side. 
> Am I missing something regarding the new option ? Is there a compelling case 
> where it differs from the NORMAL mode ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9581) Clarify discardCompoundToken behavior in the JapaneseTokenizer

Reply via email to