[ https://issues.apache.org/jira/browse/LUCENE-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205943#comment-17205943 ]
ASF subversion and git services commented on LUCENE-9458: --------------------------------------------------------- Commit 587f7302b9de62f840d85ca325a660a9fbe1e3b0 in lucene-solr's branch refs/heads/branch_8x from David Smiley [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=587f730 ] LUCENE-9458: WDGF should tie-break by endOffset (#1740) Can happen with catenateAll and not generating word xor number part when the input ends with the non-generated sub-token. Fuzzing revealed that only start & end offsets are needed to order sub-tokens. (cherry picked from commit 0303063e12049d9b470757c5ef2fb9afa6c5cd18) > WordDelimiterGraphFilter (and non-graph) should tie-break order using end > offset > -------------------------------------------------------------------------------- > > Key: LUCENE-9458 > URL: https://issues.apache.org/jira/browse/LUCENE-9458 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Reporter: David Smiley > Assignee: David Smiley > Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > WordDelimiterGraphFilter and WordDelimiterFilter do not consult the end > offset in their sub-token _ordering_. In the event of a tie-break, I propose > the longer token come first. This usually happens already, but not always, > and so this also feels like an inconsistency when you see it. This issue can > be thought of as a bug fix to LUCENE-9006 or an improvement; I have no strong > feelings on the issue classification. Before reading further, definitely > read that issue. > I see this is a problem when using CATENATE_ALL with either > GENERATE_WORD_PARTS xor GENERATE_NUMBER_PARTS when the input ends with that > part not being generated. Consider the input: "other-9" and let's assume we > want to catenate all, generate word parts, but nothing else (not numbers). > This should be tokenized in this order: "other9", "other" but today is > emitted in reverse order. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org