[jira] [Commented] (LUCENE-9581) Clarify discardCompoundToken behavior in the JapaneseTokenizer

Jim Ferenczi (Jira) Tue, 20 Oct 2020 08:19:14 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217671#comment-17217671
 ]


Jim Ferenczi commented on LUCENE-9581:
--------------------------------------

Ok I missed the fact  that we always apply the penalty on long tokens in SEARCH 
mode. So that means that we rely on the penalty to ensure that we don't output 
the longer paths. I think this is misleading though. For instance the following 
example 北海道日本ハムファイターズ gives completely different tokenization depending on the 
discard compound token value.

When set to true the output is: 
北海道, 日本ハム, ファイターズ

When set to false:
北海道, 日本,  ハムファイターズ

I would expect to have the same segmentation in such case whatever the value of 
discard compound token is. I understand that for search-purpose, the shorter 
token is better but only if we detected that the alternative path is better. 
That's what we do in the tokenizer when discard compound token is false so I 
would expect that we only discard the compound if the alternative path is 
better.



> Clarify discardCompoundToken behavior in the JapaneseTokenizer
> --------------------------------------------------------------
>
>                 Key: LUCENE-9581
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9581
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Jim Ferenczi
>            Priority: Minor
>
> At first sight, the discardCompoundToken option added in LUCENE-9123 seems 
> redundant with the NORMAL mode of the Japanese tokenizer. When set to true, 
> the current behavior is to disable the decomposition for compounds, that's 
> exactly what the NORMAL mode does.
> So I wonder if the right semantic of the option would be to keep only the 
> decomposition of the compound or if it's really needed. If the goal is to 
> make the output compatible with a graph token filter, the current workaround 
> to set the mode to NORMAL should be enough.
> That's consistent with the mode that should be used to preserve positions in 
> the index since we don't handle position length on the indexing side. 
> Am I missing something regarding the new option ? Is there a compelling case 
> where it differs from the NORMAL mode ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9581) Clarify discardCompoundToken behavior in the JapaneseTokenizer

Reply via email to