[GitHub] [lucene] jtibshirani commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

GitBox Mon, 13 Jun 2022 11:25:35 -0700


jtibshirani commented on code in PR #951:
URL: https://github.com/apache/lucene/pull/951#discussion_r896012687



##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k, 
Query filter) {
   public Query rewrite(IndexReader reader) throws IOException {
     TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
 
-    BitSetCollector filterCollector = null;
+    Weight filterWeight = null;
     if (filter != null) {
-      filterCollector = new BitSetCollector(reader.leaves().size());
       IndexSearcher indexSearcher = new IndexSearcher(reader);
       BooleanQuery booleanQuery =
           new BooleanQuery.Builder()
               .add(filter, BooleanClause.Occur.FILTER)
               .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER)
               .build();
-      indexSearcher.search(booleanQuery, filterCollector);
+      Query rewritten = indexSearcher.rewrite(booleanQuery);
+      filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
     }
 
     for (LeafReaderContext ctx : reader.leaves()) {
-      TopDocs results = searchLeaf(ctx, filterCollector);
+      Bits acceptDocs;
+      int cost;
+      if (filterWeight != null) {
+        Scorer scorer = filterWeight.scorer(ctx);
+        if (scorer != null) {
+          DocIdSetIterator iterator = scorer.iterator();
+          if (iterator instanceof BitSetIterator) {
+            acceptDocs = ((BitSetIterator) iterator).getBitSet();
+          } else {
+            acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc());
+          }
+          cost = (int) iterator.cost();

Review Comment:
   This changes the meaning of `cost` (which is directly used as 
`visitedLimit`). Before we were using the exact number of matches, whereas now 
we ask the iterator for a cost estimation. These cost estimates are sometimes 
very imprecise, and I worry it could make the query performance unpredictable 
and harder to understand.
   
   It wonder if we could convert everything to a `BitSet` and then use the 
actual cardinality. Hopefully we could do this while still keeping the nice 
performance improvement?



##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -121,35 +140,15 @@ public Query rewrite(IndexReader reader) throws 
IOException {
     return createRewrittenQuery(reader, topK);
   }
 
-  private TopDocs searchLeaf(LeafReaderContext ctx, BitSetCollector 
filterCollector)
-      throws IOException {
-
-    if (filterCollector == null) {
-      Bits acceptDocs = ctx.reader().getLiveDocs();
-      return approximateSearch(ctx, acceptDocs, Integer.MAX_VALUE);
+  private TopDocs searchLeaf(LeafReaderContext ctx, Bits acceptDocs, int cost) 
throws IOException {
+    TopDocs results = approximateSearch(ctx, acceptDocs, cost);

Review Comment:
   The new logic here drops this check -- could we make sure to keep it?
   ```
         if (filterIterator.cost() <= k) {
           // If there are <= k possible matches, short-circuit and perform 
exact search, since HNSW
           // must always visit at least k documents
           return exactSearch(ctx, filterIterator);
         }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

Reply via email to