[ 
https://issues.apache.org/jira/browse/LUCENE-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley resolved LUCENE-9458.
----------------------------------
    Fix Version/s: 8.7
       Resolution: Fixed

> WordDelimiterGraphFilter (and non-graph) should tie-break order using end 
> offset
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-9458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9458
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Minor
>             Fix For: 8.7
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> WordDelimiterGraphFilter and WordDelimiterFilter do not consult the end 
> offset in their sub-token _ordering_.  In the event of a tie-break, I propose 
> the longer token come first.  This usually happens already, but not always, 
> and so this also feels like an inconsistency when you see it.  This issue can 
> be thought of as a bug fix to LUCENE-9006 or an improvement; I have no strong 
> feelings on the issue classification.  Before reading further, definitely 
> read that issue.
> I see this is a problem when using CATENATE_ALL with either 
> GENERATE_WORD_PARTS xor GENERATE_NUMBER_PARTS when the input ends with that 
> part not being generated.  Consider the input: "other-9" and let's assume we 
> want to catenate all, generate word parts, but nothing else (not numbers).  
> This should be tokenized in this order: "other9", "other" but today is 
> emitted in reverse order.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to