[GitHub] [lucene] kaivalnp commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

GitBox Thu, 23 Jun 2022 07:05:59 -0700


kaivalnp commented on code in PR #951:
URL: https://github.com/apache/lucene/pull/951#discussion_r905065952



##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k, 
Query filter) {
   public Query rewrite(IndexReader reader) throws IOException {
     TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
 
-    BitSetCollector filterCollector = null;
+    Weight filterWeight = null;
     if (filter != null) {
-      filterCollector = new BitSetCollector(reader.leaves().size());
       IndexSearcher indexSearcher = new IndexSearcher(reader);
       BooleanQuery booleanQuery =
           new BooleanQuery.Builder()
               .add(filter, BooleanClause.Occur.FILTER)
               .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER)
               .build();
-      indexSearcher.search(booleanQuery, filterCollector);
+      Query rewritten = indexSearcher.rewrite(booleanQuery);
+      filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
     }
 
     for (LeafReaderContext ctx : reader.leaves()) {
-      TopDocs results = searchLeaf(ctx, filterCollector);
+      Bits acceptDocs;
+      int cost;
+      if (filterWeight != null) {
+        Scorer scorer = filterWeight.scorer(ctx);
+        if (scorer != null) {
+          DocIdSetIterator iterator = scorer.iterator();
+          if (iterator instanceof BitSetIterator) {
+            acceptDocs = ((BitSetIterator) iterator).getBitSet();
+          } else {
+            acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc());
+          }
+          cost = (int) iterator.cost();

Review Comment:
   The approach you mentioned makes sense, and I have added it into the latest 
commit.
   
   But I feel we are missing out on some performance because of inaccurate 
estimation for `approximateSearch`. We could try @jpountz's suggestion of 
pro-rating the cost of the `BitSet` (since we only want some estimation), and 
performing `approximateSearch` on that. Something like:
   
   ```java
   Bits acceptDocs, liveDocs = ctx.reader().getLiveDocs();
   int maxDoc = ctx.reader().maxDoc(), cost;
   DocIdSetIterator iterator = scorer.iterator();
   if (iterator instanceof BitSetIterator bitSetIterator) {
     BitSet bitSet = bitSetIterator.getBitSet();
     acceptDocs = new Bits() {
       @Override
       public boolean get(int index) {
         return bitSet.get(index) && (liveDocs == null || liveDocs.get(index));
       }
   
       @Override
       public int length() {
         return maxDoc;
       }
     };
     cost = bitSet.cardinality() * ctx.reader().numDocs() / maxDoc;
     TopDocs results = approximateSearch(ctx, acceptDocs, cost);
     if (results.totalHits.relation == TotalHits.Relation.EQUAL_TO) {
       return results;
     } else {
       FilteredDocIdSetIterator filterIterator = new 
FilteredDocIdSetIterator(iterator) {
         @Override
         protected boolean match(int doc) {
           return liveDocs == null || liveDocs.get(doc);
         }
       };
       return exactSearch(ctx, filterIterator);
     }
   }
   ```
   
   This way we won't traverse the `BitSet` even for deletions. If the limit is 
reached, we anyways have to iterate over the `iterator` to build a `BitSet`, so 
instead we can pass the filtered `iterator` to `exactSearch`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] kaivalnp commented on a diff in pull request #951: LUCENE-10606: Optimize Prefilter Hit Collection

Reply via email to