[ https://issues.apache.org/jira/browse/LUCENE-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031632#comment-17031632 ]
ASF subversion and git services commented on LUCENE-9099: --------------------------------------------------------- Commit 7c1ba1aebeea540b67ae304deee60162baee2e12 in lucene-solr's branch refs/heads/master from Alan Woodward [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7c1ba1a ] LUCENE-9099: Correctly handle repeats in ORDERED and UNORDERED intervals (#1097) If you have repeating intervals in an ordered or unordered interval source, you currently get somewhat confusing behaviour: * `ORDERED(a, a, b)` will return an extra interval over just a b if it first matches a a b, meaning that you can get incorrect results if used in a `CONTAINING` filter - `CONTAINING(ORDERED(x, y), ORDERED(a, a, b))` will match on the document `a x a b y` * `UNORDERED(a, a)` will match on documents that just containg a single a. This commit adds a RepeatingIntervalsSource that correctly handles repeats within ordered and unordered sources. It also changes the way that gaps are calculated within ordered and unordered sources, by using a new width() method on IntervalIterator. The default implementation just returns end() - start() + 1, but RepeatingIntervalsSource instead returns the sum of the widths of its child iterators. This preserves maxgaps filtering on ordered and unordered sources that contain repeats. In order to correctly handle matches in this scenario, IntervalsSource#matches now always returns an explicit IntervalsMatchesIterator rather than a plain MatchesIterator, which adds gaps() and width() methods so that submatches can be combined in the same way that subiterators are. Extra checks have been added to checkIntervals() to ensure that the same intervals are returned by both iterator and matches, and a fix to DisjunctionIntervalIterator#matches() is also included - DisjunctionIntervalIterator minimizes its intervals, while MatchesUtils.disjunction does not, so there was a discrepancy between the two methods. > Correctly handle repeats in ordered and unordered intervals > ----------------------------------------------------------- > > Key: LUCENE-9099 > URL: https://issues.apache.org/jira/browse/LUCENE-9099 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Alan Woodward > Assignee: Alan Woodward > Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > If you have repeating intervals in an ordered or unordered interval source, > you currently get somewhat confusing behaviour: > * ORDERED(a, a, b) will return an extra interval over just `a b` if it first > matches `a a b`, meaning that you can get incorrect results if used in a > CONTAINING filter - CONTAINING(ORDERED(x, y), ORDERED(a, a, b)) will match on > the document `a x a b y` > * UNORDERED(a, a) will match on documents that just containg a single `a`. > It is possible to deal with the unordered case when building sources by > rewriting duplicates to nested ORDERED clauses, so that UNORDERED(a, b, c, a, > b) becomes UNORDERED(ORDERED(a, a), ORDERED(b, b), c), but this then breaks > MAXGAPS filtering. > We should try and fix this within intervals themselves. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org