[ https://issues.apache.org/jira/browse/LUCENE-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Smiley resolved LUCENE-9458. ---------------------------------- Fix Version/s: 8.7 Resolution: Fixed > WordDelimiterGraphFilter (and non-graph) should tie-break order using end > offset > -------------------------------------------------------------------------------- > > Key: LUCENE-9458 > URL: https://issues.apache.org/jira/browse/LUCENE-9458 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Reporter: David Smiley > Assignee: David Smiley > Priority: Minor > Fix For: 8.7 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > WordDelimiterGraphFilter and WordDelimiterFilter do not consult the end > offset in their sub-token _ordering_. In the event of a tie-break, I propose > the longer token come first. This usually happens already, but not always, > and so this also feels like an inconsistency when you see it. This issue can > be thought of as a bug fix to LUCENE-9006 or an improvement; I have no strong > feelings on the issue classification. Before reading further, definitely > read that issue. > I see this is a problem when using CATENATE_ALL with either > GENERATE_WORD_PARTS xor GENERATE_NUMBER_PARTS when the input ends with that > part not being generated. Consider the input: "other-9" and let's assume we > want to catenate all, generate word parts, but nothing else (not numbers). > This should be tokenized in this order: "other9", "other" but today is > emitted in reverse order. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org