[jira] [Commented] (LUCENE-9581) Clarify discardCompoundToken behavior in the JapaneseTokenizer

Jun Ohtani (Jira) Tue, 20 Oct 2020 06:53:14 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217617#comment-17217617
 ]


Jun Ohtani commented on LUCENE-9581:
------------------------------------

Hi Jim,

NORMAL mode doesn't produce the same tokens that are produced by SEARCH mode + 
discard compound tokens option.

e.g. ”株式会社” .

Normal mode : "株式会社"

Search mode : "株式", "株式会社", "会社"

Search mode + discard_compound_token: "株式", "会社"

 

For search purpose, the shorter token is better than longer it.

Because we can search "株式会社" by "会社".

> Clarify discardCompoundToken behavior in the JapaneseTokenizer
> --------------------------------------------------------------
>
>                 Key: LUCENE-9581
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9581
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Jim Ferenczi
>            Priority: Minor
>
> At first sight, the discardCompoundToken option added in LUCENE-9123 seems 
> redundant with the NORMAL mode of the Japanese tokenizer. When set to true, 
> the current behavior is to disable the decomposition for compounds, that's 
> exactly what the NORMAL mode does.
> So I wonder if the right semantic of the option would be to keep only the 
> decomposition of the compound or if it's really needed. If the goal is to 
> make the output compatible with a graph token filter, the current workaround 
> to set the mode to NORMAL should be enough.
> That's consistent with the mode that should be used to preserve positions in 
> the index since we don't handle position length on the indexing side. 
> Am I missing something regarding the new option ? Is there a compelling case 
> where it differs from the NORMAL mode ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9581) Clarify discardCompoundToken behavior in the JapaneseTokenizer

Reply via email to