[jira] [Commented] (LUCENE-9099) Correctly handle repeats in ordered and unordered intervals

ASF subversion and git services (Jira) Thu, 06 Feb 2020 06:46:13 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031632#comment-17031632
 ]


ASF subversion and git services commented on LUCENE-9099:
---------------------------------------------------------

Commit 7c1ba1aebeea540b67ae304deee60162baee2e12 in lucene-solr's branch 
refs/heads/master from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7c1ba1a ]

LUCENE-9099: Correctly handle repeats in ORDERED and UNORDERED intervals (#1097)

If you have repeating intervals in an ordered or unordered interval source, you 
currently 
get somewhat confusing behaviour:

* `ORDERED(a, a, b)` will return an extra interval over just a b if it first 
matches a a b, meaning
that you can get incorrect results if used in a `CONTAINING` filter - 
`CONTAINING(ORDERED(x, y), ORDERED(a, a, b))` will match on the document `a x a 
b y`
* `UNORDERED(a, a)` will match on documents that just containg a single a.

This commit adds a RepeatingIntervalsSource that correctly handles repeats 
within 
ordered and unordered sources. It also changes the way that gaps are calculated 
within 
ordered and unordered sources, by using a new width() method on 
IntervalIterator. The 
default implementation just returns end() - start() + 1, but 
RepeatingIntervalsSource 
instead returns the sum of the widths of its child iterators. This preserves 
maxgaps filtering 
on ordered and unordered sources that contain repeats.

In order to correctly handle matches in this scenario, IntervalsSource#matches 
now always 
returns an explicit IntervalsMatchesIterator rather than a plain 
MatchesIterator, which adds 
gaps() and width() methods so that submatches can be combined in the same way 
that 
subiterators are. Extra checks have been added to checkIntervals() to ensure 
that the same 
intervals are returned by both iterator and matches, and a fix to 
DisjunctionIntervalIterator#matches() is also included - 
DisjunctionIntervalIterator minimizes 
its intervals, while MatchesUtils.disjunction does not, so there was a 
discrepancy between 
the two methods.


> Correctly handle repeats in ordered and unordered intervals
> -----------------------------------------------------------
>
>                 Key: LUCENE-9099
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9099
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> If you have repeating intervals in an ordered or unordered interval source, 
> you currently get somewhat confusing behaviour:
> * ORDERED(a, a, b) will return an extra interval over just `a b` if it first 
> matches `a a b`, meaning that you can get incorrect results if used in a 
> CONTAINING filter - CONTAINING(ORDERED(x, y), ORDERED(a, a, b)) will match on 
> the document `a x a b y`
> * UNORDERED(a, a) will match on documents that just containg a single `a`.
> It is possible to deal with the unordered case when building sources by 
> rewriting duplicates to nested ORDERED clauses, so that UNORDERED(a, b, c, a, 
> b) becomes UNORDERED(ORDERED(a, a), ORDERED(b, b), c), but this then breaks 
> MAXGAPS filtering.
> We should try and fix this within intervals themselves.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-9099) Correctly handle repeats in ordered and unordered intervals

Reply via email to