[GitHub] [lucene] jtibshirani commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

GitBox Fri, 24 Jun 2022 14:39:07 -0700


jtibshirani commented on code in PR #951:
URL: https://github.com/apache/lucene/pull/951#discussion_r906418634



##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -121,36 +120,50 @@ public Query rewrite(IndexReader reader) throws 
IOException {
     return createRewrittenQuery(reader, topK);
   }
 
-  private TopDocs searchLeaf(LeafReaderContext ctx, BitSetCollector 
filterCollector)
-      throws IOException {
+  private TopDocs searchLeaf(LeafReaderContext ctx, Weight filterWeight) 
throws IOException {
+    Bits liveDocs = ctx.reader().getLiveDocs();
+    int maxDoc = ctx.reader().maxDoc();
 
-    if (filterCollector == null) {
-      Bits acceptDocs = ctx.reader().getLiveDocs();
-      return approximateSearch(ctx, acceptDocs, Integer.MAX_VALUE);
+    if (filterWeight == null) {
+      return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE);
     } else {
-      BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord);
-      if (filterIterator == null || filterIterator.cost() == 0) {
+      Scorer scorer = filterWeight.scorer(ctx);
+      if (scorer == null) {
         return NO_RESULTS;
-      }
+      } else {
+        BitSetIterator filterIterator =
+            cacheIntoBitSetIterator(scorer.iterator(), liveDocs, maxDoc);
 
-      if (filterIterator.cost() <= k) {
-        // If there are <= k possible matches, short-circuit and perform exact 
search, since HNSW
-        // must always visit at least k documents
-        return exactSearch(ctx, filterIterator);
+        if (filterIterator.cost() <= k) {
+          return exactSearch(ctx, filterIterator);

Review Comment:
   Could we restore all the comments in this section? I think they're helpful 
in understanding the algorithm.



##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -121,36 +120,50 @@ public Query rewrite(IndexReader reader) throws 
IOException {
     return createRewrittenQuery(reader, topK);
   }
 
-  private TopDocs searchLeaf(LeafReaderContext ctx, BitSetCollector 
filterCollector)
-      throws IOException {
+  private TopDocs searchLeaf(LeafReaderContext ctx, Weight filterWeight) 
throws IOException {
+    Bits liveDocs = ctx.reader().getLiveDocs();
+    int maxDoc = ctx.reader().maxDoc();
 
-    if (filterCollector == null) {
-      Bits acceptDocs = ctx.reader().getLiveDocs();
-      return approximateSearch(ctx, acceptDocs, Integer.MAX_VALUE);
+    if (filterWeight == null) {
+      return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE);
     } else {
-      BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord);
-      if (filterIterator == null || filterIterator.cost() == 0) {
+      Scorer scorer = filterWeight.scorer(ctx);

Review Comment:
   Small suggestion, I often like to remove the "else" when the "if" statement 
has already returned a value. This avoids having a lot of highly nested else/ 
if statements. This suggestion applies to a few places in this method.



##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -121,36 +120,50 @@ public Query rewrite(IndexReader reader) throws 
IOException {
     return createRewrittenQuery(reader, topK);
   }
 
-  private TopDocs searchLeaf(LeafReaderContext ctx, BitSetCollector 
filterCollector)
-      throws IOException {
+  private TopDocs searchLeaf(LeafReaderContext ctx, Weight filterWeight) 
throws IOException {
+    Bits liveDocs = ctx.reader().getLiveDocs();
+    int maxDoc = ctx.reader().maxDoc();
 
-    if (filterCollector == null) {
-      Bits acceptDocs = ctx.reader().getLiveDocs();
-      return approximateSearch(ctx, acceptDocs, Integer.MAX_VALUE);
+    if (filterWeight == null) {
+      return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE);
     } else {
-      BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord);
-      if (filterIterator == null || filterIterator.cost() == 0) {
+      Scorer scorer = filterWeight.scorer(ctx);
+      if (scorer == null) {
         return NO_RESULTS;
-      }
+      } else {
+        BitSetIterator filterIterator =
+            cacheIntoBitSetIterator(scorer.iterator(), liveDocs, maxDoc);
 
-      if (filterIterator.cost() <= k) {
-        // If there are <= k possible matches, short-circuit and perform exact 
search, since HNSW
-        // must always visit at least k documents
-        return exactSearch(ctx, filterIterator);
+        if (filterIterator.cost() <= k) {
+          return exactSearch(ctx, filterIterator);
+        }
+        TopDocs results =
+            approximateSearch(ctx, filterIterator.getBitSet(), (int) 
filterIterator.cost());
+        if (results.totalHits.relation == TotalHits.Relation.EQUAL_TO) {
+          return results;
+        } else {
+          return exactSearch(ctx, filterIterator);
+        }
       }
+    }
+  }
 
-      // Perform the approximate kNN search
-      Bits acceptDocs =
-          filterIterator.getBitSet(); // The filter iterator already 
incorporates live docs
-      int visitedLimit = (int) filterIterator.cost();
-      TopDocs results = approximateSearch(ctx, acceptDocs, visitedLimit);
-      if (results.totalHits.relation == TotalHits.Relation.EQUAL_TO) {
-        return results;
-      } else {
-        // We stopped the kNN search because it visited too many nodes, so 
fall back to exact search
-        return exactSearch(ctx, filterIterator);
-      }
+  private BitSetIterator cacheIntoBitSetIterator(

Review Comment:
   Small comment, maybe it's clearer to return a `BitSet` here? Then we could 
just take the cardinality in the calling method. It was actually a little hacky 
before that we were using `BitSetIterator#cost` to always represent the 
cardinality (my fault!), when the cost is a separate concept from cardinality.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

Reply via email to