[GitHub] [lucene-solr] jpountz commented on a change in pull request #1351: LUCENE-9280: Collectors to skip noncompetitive documents

GitBox Fri, 24 Apr 2020 01:51:43 -0700


jpountz commented on a change in pull request #1351:
URL: https://github.com/apache/lucene-solr/pull/1351#discussion_r414392251




##########
File path: 
lucene/core/src/java/org/apache/lucene/search/FilterLeafCollector.java
##########
@@ -53,4 +53,8 @@ public String toString() {
     return name + "(" + in + ")";
   }
 
+  @Override
+  public DocIdSetIterator competitiveIterator() {
+    return in.competitiveIterator();
+  }

Review comment:
       We've had endless discussions about whether or not to delegate in 
FilterXXX classes and I think that the consensus is that we should only 
delegate abstract methods. Since this one has a default implementation, let's 
not delegate and look for extensions of FilterCollector that should delegate 
it? (e.g. asserting collectors)

##########
File path: 
lucene/core/src/java/org/apache/lucene/search/FilteringFieldComparator.java
##########
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+
+/**
+ * Decorates a wrapped FieldComparator to add a functionality to skip over 
non-competitive docs.
+ * FilteringFieldComparator provides two additional functions for a 
FieldComparator:
+ * 1) {@code competitiveIterator()} that returns an iterator over
+ *      competitive docs that are stronger than already collected docs.
+ * 2) {@code setCanUpdateIterator()} that notifies the comparator when it is 
ok to start updating its internal iterator.
+ *  This method is called from a collector to inform the comparator to start 
updating its iterator.
+ */
+public abstract class FilteringFieldComparator<T> extends FieldComparator<T> {
+    final FieldComparator<T> in;
+
+    public FilteringFieldComparator(FieldComparator<T> in) {
+        this.in = in;
+    }
+
+    protected abstract DocIdSetIterator competitiveIterator();
+
+    protected abstract void setCanUpdateIterator() throws IOException;

Review comment:
       can you add javadocs?

##########
File path: 
lucene/core/src/java/org/apache/lucene/search/FilteringFieldComparator.java
##########
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+
+/**
+ * Decorates a wrapped FieldComparator to add a functionality to skip over 
non-competitive docs.
+ * FilteringFieldComparator provides two additional functions for a 
FieldComparator:
+ * 1) {@code competitiveIterator()} that returns an iterator over
+ *      competitive docs that are stronger than already collected docs.
+ * 2) {@code setCanUpdateIterator()} that notifies the comparator when it is 
ok to start updating its internal iterator.
+ *  This method is called from a collector to inform the comparator to start 
updating its iterator.
+ */
+public abstract class FilteringFieldComparator<T> extends FieldComparator<T> {
+    final FieldComparator<T> in;
+
+    public FilteringFieldComparator(FieldComparator<T> in) {
+        this.in = in;
+    }
+
+    protected abstract DocIdSetIterator competitiveIterator();

Review comment:
       Let's only have this method on LeafFieldCompatarors, e.g. by doing 
something like this? FieldComparators are top-level objects so it doesn't make 
sense to have leaf-level objects defined on them like DocIdSetIterators.
   
   ```suggestion
     @Override
     public abstract FilteringLeafFieldComparator 
getLeafComparator(LeafReaderContext context) throws IOException; // covariant 
return type
   
     public static interface FilteringLeafFieldComparator extends 
LeafFieldComparator {
       DocIdSetIterator competitiveIterator();
     }
   ```
   
   Then there's a question of whether setCanUpdateIterator should go there too 
as well or not.

##########
File path: 
lucene/core/src/java/org/apache/lucene/search/FilteringNumericComparator.java
##########
@@ -0,0 +1,329 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import org.apache.lucene.document.DoublePoint;
+import org.apache.lucene.document.FloatPoint;
+import org.apache.lucene.document.IntPoint;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PointValues;
+import org.apache.lucene.util.DocIdSetBuilder;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+/**
+ * A wrapper over {@code NumericComparator} that adds a functionality to 
filter non-competitive docs.
+ */
+public abstract class FilteringNumericComparator<T extends Number> extends 
FilteringFieldComparator<T> implements LeafFieldComparator {
+    protected final boolean reverse;
+    private boolean hasTopValue = false;
+    private PointValues pointValues;
+    private final int bytesCount;
+    private final byte[] minValueAsBytes;
+    private final byte[] maxValueAsBytes;
+    private boolean minValueExist = false;
+    private boolean maxValueExist = false;
+    private int maxDoc;
+    private int maxDocVisited;
+    private int updateCounter = 0;
+    private final String field;
+    protected boolean canUpdateIterator = false; // set to true when queue 
becomes full and hitsThreshold is reached
+    protected DocIdSetIterator competitiveIterator = null;
+    private long iteratorCost = 0;
+
+    public FilteringNumericComparator(NumericComparator<T> in, boolean 
reverse, int bytesCount) {
+        super(in);
+        this.field = in.field;
+        this.bytesCount = bytesCount;
+        this.reverse = reverse;
+        minValueAsBytes = new byte[bytesCount];
+        maxValueAsBytes = new byte[bytesCount];
+        if (reverse) {
+            minValueExist = true;
+        } else {
+            maxValueExist = true;
+        }
+    }
+
+    /**
+     * Returns an iterator over competitive documents
+     */
+    @Override
+    public DocIdSetIterator competitiveIterator() {
+        if (competitiveIterator == null) return null;
+        return new DocIdSetIterator() {
+            private int doc;
+            @Override
+            public int nextDoc() throws IOException {
+                return doc = competitiveIterator.nextDoc();
+            }
+
+            @Override
+            public int docID() {
+                return doc;
+            }
+
+            @Override
+            public long cost() {
+                return competitiveIterator.cost();
+            }
+
+            @Override
+            public int advance(int target) throws IOException {
+                return doc = competitiveIterator.advance(target);
+            }
+        };
+    }
+
+    @Override
+    public void setCanUpdateIterator() throws IOException {
+        this.canUpdateIterator = true;
+        // for the 1st time queue becomes full and hitsThreshold is reached
+        // we can start updating competitive iterator
+        updateCompetitiveIterator();
+    }
+
+    @Override
+    public void setTopValue(T value) {
+        hasTopValue = true;
+        if (reverse) {
+            maxValueExist = true;
+        } else {
+            minValueExist = true;
+        }
+        in.setTopValue(value);
+    }
+
+    @Override
+    public void setBottom(int slot) throws IOException {
+        ((NumericComparator) in).setBottom(slot);
+        updateCompetitiveIterator(); // update an iterator if we set a new 
bottom
+    }
+
+    @Override
+    public int compareBottom(int doc) throws IOException {
+        return ((NumericComparator) in).compareBottom(doc);
+    }
+
+    @Override
+    public int compareTop(int doc) throws IOException {
+        return ((NumericComparator) in).compareTop(doc);
+    }
+
+    @Override
+    public void copy(int slot, int doc) throws IOException {
+        ((NumericComparator) in).copy(slot, doc);
+    }
+
+    @Override
+    public void setScorer(Scorable scorer) throws IOException {
+        ((NumericComparator) in).setScorer(scorer);
+        if (scorer instanceof Scorer) {
+            iteratorCost = ((Scorer) scorer).iterator().cost(); // starting 
iterator cost is the scorer's cost
+            updateCompetitiveIterator(); // update an iterator when we have a 
new segment
+        }

Review comment:
       Let's open a separate issue about adding `Scorable.cost()` to make such 
optimizations easier?

##########
File path: 
lucene/core/src/java/org/apache/lucene/search/FilteringFieldComparator.java
##########
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+
+/**
+ * Decorates a wrapped FieldComparator to add a functionality to skip over 
non-competitive docs.
+ * FilteringFieldComparator provides two additional functions for a 
FieldComparator:
+ * 1) {@code competitiveIterator()} that returns an iterator over
+ *      competitive docs that are stronger than already collected docs.
+ * 2) {@code setCanUpdateIterator()} that notifies the comparator when it is 
ok to start updating its internal iterator.
+ *  This method is called from a collector to inform the comparator to start 
updating its iterator.
+ */
+public abstract class FilteringFieldComparator<T> extends FieldComparator<T> {
+    final FieldComparator<T> in;

Review comment:
       indentation looks wrong, Lucene uses 2-spaces indentation

##########
File path: 
lucene/core/src/java/org/apache/lucene/search/FilteringNumericComparator.java
##########
@@ -0,0 +1,329 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import org.apache.lucene.document.DoublePoint;
+import org.apache.lucene.document.FloatPoint;
+import org.apache.lucene.document.IntPoint;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PointValues;
+import org.apache.lucene.util.DocIdSetBuilder;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+/**
+ * A wrapper over {@code NumericComparator} that adds a functionality to 
filter non-competitive docs.
+ */
+public abstract class FilteringNumericComparator<T extends Number> extends 
FilteringFieldComparator<T> implements LeafFieldComparator {
+    protected final boolean reverse;
+    private boolean hasTopValue = false;
+    private PointValues pointValues;
+    private final int bytesCount;
+    private final byte[] minValueAsBytes;
+    private final byte[] maxValueAsBytes;
+    private boolean minValueExist = false;
+    private boolean maxValueExist = false;
+    private int maxDoc;
+    private int maxDocVisited;
+    private int updateCounter = 0;
+    private final String field;
+    protected boolean canUpdateIterator = false; // set to true when queue 
becomes full and hitsThreshold is reached
+    protected DocIdSetIterator competitiveIterator = null;
+    private long iteratorCost = 0;
+
+    public FilteringNumericComparator(NumericComparator<T> in, boolean 
reverse, int bytesCount) {
+        super(in);
+        this.field = in.field;
+        this.bytesCount = bytesCount;
+        this.reverse = reverse;
+        minValueAsBytes = new byte[bytesCount];
+        maxValueAsBytes = new byte[bytesCount];
+        if (reverse) {
+            minValueExist = true;
+        } else {
+            maxValueExist = true;
+        }
+    }
+
+    /**
+     * Returns an iterator over competitive documents
+     */
+    @Override
+    public DocIdSetIterator competitiveIterator() {
+        if (competitiveIterator == null) return null;
+        return new DocIdSetIterator() {
+            private int doc;
+            @Override
+            public int nextDoc() throws IOException {
+                return doc = competitiveIterator.nextDoc();
+            }
+
+            @Override
+            public int docID() {
+                return doc;
+            }
+
+            @Override
+            public long cost() {
+                return competitiveIterator.cost();
+            }
+
+            @Override
+            public int advance(int target) throws IOException {
+                return doc = competitiveIterator.advance(target);
+            }
+        };
+    }
+
+    @Override
+    public void setCanUpdateIterator() throws IOException {
+        this.canUpdateIterator = true;
+        // for the 1st time queue becomes full and hitsThreshold is reached
+        // we can start updating competitive iterator
+        updateCompetitiveIterator();
+    }
+
+    @Override
+    public void setTopValue(T value) {
+        hasTopValue = true;
+        if (reverse) {
+            maxValueExist = true;
+        } else {
+            minValueExist = true;
+        }
+        in.setTopValue(value);
+    }
+
+    @Override
+    public void setBottom(int slot) throws IOException {
+        ((NumericComparator) in).setBottom(slot);
+        updateCompetitiveIterator(); // update an iterator if we set a new 
bottom
+    }
+
+    @Override
+    public int compareBottom(int doc) throws IOException {
+        return ((NumericComparator) in).compareBottom(doc);
+    }
+
+    @Override
+    public int compareTop(int doc) throws IOException {
+        return ((NumericComparator) in).compareTop(doc);
+    }
+
+    @Override
+    public void copy(int slot, int doc) throws IOException {
+        ((NumericComparator) in).copy(slot, doc);
+    }
+
+    @Override
+    public void setScorer(Scorable scorer) throws IOException {
+        ((NumericComparator) in).setScorer(scorer);
+        if (scorer instanceof Scorer) {
+            iteratorCost = ((Scorer) scorer).iterator().cost(); // starting 
iterator cost is the scorer's cost
+            updateCompetitiveIterator(); // update an iterator when we have a 
new segment
+        }
+    }
+
+    @Override
+    public final LeafFieldComparator getLeafComparator(LeafReaderContext 
context) throws IOException {
+        ((NumericComparator) in).doSetNextReader(context);
+        maxDoc = context.reader().maxDoc();
+        maxDocVisited = 0;
+        pointValues = context.reader().getPointValues(field);
+        // TODO: optimize a case when pointValues are missing only on this 
segment
+        competitiveIterator = pointValues == null ? null : 
DocIdSetIterator.all(maxDoc);
+        iteratorCost = 0;
+        return this;

Review comment:
       Related to my above comment, I think we could make things cleaner by 
implementing the comparator and the leaf comparator in different classes. Then 
each class would be smaller and many fields could become final.

##########
File path: 
lucene/core/src/java/org/apache/lucene/search/FilteringNumericComparator.java
##########
@@ -0,0 +1,329 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import org.apache.lucene.document.DoublePoint;
+import org.apache.lucene.document.FloatPoint;
+import org.apache.lucene.document.IntPoint;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PointValues;
+import org.apache.lucene.util.DocIdSetBuilder;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+/**
+ * A wrapper over {@code NumericComparator} that adds a functionality to 
filter non-competitive docs.
+ */
+public abstract class FilteringNumericComparator<T extends Number> extends 
FilteringFieldComparator<T> implements LeafFieldComparator {
+    protected final boolean reverse;
+    private boolean hasTopValue = false;
+    private PointValues pointValues;
+    private final int bytesCount;
+    private final byte[] minValueAsBytes;
+    private final byte[] maxValueAsBytes;
+    private boolean minValueExist = false;
+    private boolean maxValueExist = false;
+    private int maxDoc;
+    private int maxDocVisited;
+    private int updateCounter = 0;
+    private final String field;
+    protected boolean canUpdateIterator = false; // set to true when queue 
becomes full and hitsThreshold is reached
+    protected DocIdSetIterator competitiveIterator = null;
+    private long iteratorCost = 0;
+
+    public FilteringNumericComparator(NumericComparator<T> in, boolean 
reverse, int bytesCount) {
+        super(in);
+        this.field = in.field;
+        this.bytesCount = bytesCount;
+        this.reverse = reverse;
+        minValueAsBytes = new byte[bytesCount];
+        maxValueAsBytes = new byte[bytesCount];
+        if (reverse) {
+            minValueExist = true;
+        } else {
+            maxValueExist = true;
+        }
+    }
+
+    /**
+     * Returns an iterator over competitive documents
+     */
+    @Override
+    public DocIdSetIterator competitiveIterator() {
+        if (competitiveIterator == null) return null;
+        return new DocIdSetIterator() {
+            private int doc;
+            @Override
+            public int nextDoc() throws IOException {
+                return doc = competitiveIterator.nextDoc();
+            }
+
+            @Override
+            public int docID() {
+                return doc;
+            }
+
+            @Override
+            public long cost() {
+                return competitiveIterator.cost();
+            }
+
+            @Override
+            public int advance(int target) throws IOException {
+                return doc = competitiveIterator.advance(target);
+            }
+        };
+    }
+
+    @Override
+    public void setCanUpdateIterator() throws IOException {
+        this.canUpdateIterator = true;
+        // for the 1st time queue becomes full and hitsThreshold is reached
+        // we can start updating competitive iterator
+        updateCompetitiveIterator();
+    }
+
+    @Override
+    public void setTopValue(T value) {
+        hasTopValue = true;
+        if (reverse) {
+            maxValueExist = true;
+        } else {
+            minValueExist = true;
+        }
+        in.setTopValue(value);
+    }
+
+    @Override
+    public void setBottom(int slot) throws IOException {
+        ((NumericComparator) in).setBottom(slot);
+        updateCompetitiveIterator(); // update an iterator if we set a new 
bottom
+    }
+
+    @Override
+    public int compareBottom(int doc) throws IOException {
+        return ((NumericComparator) in).compareBottom(doc);
+    }
+
+    @Override
+    public int compareTop(int doc) throws IOException {
+        return ((NumericComparator) in).compareTop(doc);
+    }
+
+    @Override
+    public void copy(int slot, int doc) throws IOException {
+        ((NumericComparator) in).copy(slot, doc);
+    }
+
+    @Override
+    public void setScorer(Scorable scorer) throws IOException {
+        ((NumericComparator) in).setScorer(scorer);
+        if (scorer instanceof Scorer) {
+            iteratorCost = ((Scorer) scorer).iterator().cost(); // starting 
iterator cost is the scorer's cost
+            updateCompetitiveIterator(); // update an iterator when we have a 
new segment
+        }
+    }
+
+    @Override
+    public final LeafFieldComparator getLeafComparator(LeafReaderContext 
context) throws IOException {
+        ((NumericComparator) in).doSetNextReader(context);
+        maxDoc = context.reader().maxDoc();
+        maxDocVisited = 0;
+        pointValues = context.reader().getPointValues(field);
+        // TODO: optimize a case when pointValues are missing only on this 
segment
+        competitiveIterator = pointValues == null ? null : 
DocIdSetIterator.all(maxDoc);
+        iteratorCost = 0;

Review comment:
       I'd rather have maxDoc as a default, the reason being that this 
optimization otherwise wouldn't apply with MatchAllDocsQuery, which would be a 
pity.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1351: LUCENE-9280: Collectors to skip noncompetitive documents

Reply via email to