[
https://issues.apache.org/jira/browse/LUCENE-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031632#comment-17031632
]
ASF subversion and git services commented on LUCENE-9099:
---------------------------------------------------------
Commit 7c1ba1aebeea540b67ae304deee60162baee2e12 in lucene-solr's branch
refs/heads/master from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7c1ba1a ]
LUCENE-9099: Correctly handle repeats in ORDERED and UNORDERED intervals (#1097)
If you have repeating intervals in an ordered or unordered interval source, you
currently
get somewhat confusing behaviour:
* `ORDERED(a, a, b)` will return an extra interval over just a b if it first
matches a a b, meaning
that you can get incorrect results if used in a `CONTAINING` filter -
`CONTAINING(ORDERED(x, y), ORDERED(a, a, b))` will match on the document `a x a
b y`
* `UNORDERED(a, a)` will match on documents that just containg a single a.
This commit adds a RepeatingIntervalsSource that correctly handles repeats
within
ordered and unordered sources. It also changes the way that gaps are calculated
within
ordered and unordered sources, by using a new width() method on
IntervalIterator. The
default implementation just returns end() - start() + 1, but
RepeatingIntervalsSource
instead returns the sum of the widths of its child iterators. This preserves
maxgaps filtering
on ordered and unordered sources that contain repeats.
In order to correctly handle matches in this scenario, IntervalsSource#matches
now always
returns an explicit IntervalsMatchesIterator rather than a plain
MatchesIterator, which adds
gaps() and width() methods so that submatches can be combined in the same way
that
subiterators are. Extra checks have been added to checkIntervals() to ensure
that the same
intervals are returned by both iterator and matches, and a fix to
DisjunctionIntervalIterator#matches() is also included -
DisjunctionIntervalIterator minimizes
its intervals, while MatchesUtils.disjunction does not, so there was a
discrepancy between
the two methods.
> Correctly handle repeats in ordered and unordered intervals
> -----------------------------------------------------------
>
> Key: LUCENE-9099
> URL: https://issues.apache.org/jira/browse/LUCENE-9099
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Alan Woodward
> Assignee: Alan Woodward
> Priority: Major
> Time Spent: 20m
> Remaining Estimate: 0h
>
> If you have repeating intervals in an ordered or unordered interval source,
> you currently get somewhat confusing behaviour:
> * ORDERED(a, a, b) will return an extra interval over just `a b` if it first
> matches `a a b`, meaning that you can get incorrect results if used in a
> CONTAINING filter - CONTAINING(ORDERED(x, y), ORDERED(a, a, b)) will match on
> the document `a x a b y`
> * UNORDERED(a, a) will match on documents that just containg a single `a`.
> It is possible to deal with the unordered case when building sources by
> rewriting duplicates to nested ORDERED clauses, so that UNORDERED(a, b, c, a,
> b) becomes UNORDERED(ORDERED(a, a), ORDERED(b, b), c), but this then breaks
> MAXGAPS filtering.
> We should try and fix this within intervals themselves.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]