[GitHub] [lucene] gsmiller commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

GitBox Tue, 03 Jan 2023 11:10:38 -0800


gsmiller commented on code in PR #12055:
URL: https://github.com/apache/lucene/pull/12055#discussion_r1060869958



##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -183,23 +182,31 @@ private WeightOrDocIdSet rewrite(LeafReaderContext 
context) throws IOException {
           }
           Query q = new ConstantScoreQuery(bq.build());
           final Weight weight = searcher.rewrite(q).createWeight(searcher, 
scoreMode, score());
-          return new WeightOrDocIdSet(weight);
+          return new WeightOrDocIdSetIterator(weight);
         }
 
         // Too many terms: go back to the terms we already collected and start 
building the bit set
-        DocIdSetBuilder builder = new 
DocIdSetBuilder(context.reader().maxDoc(), terms);
+        PriorityQueue<PostingsEnum> highFrequencyTerms =
+            new PriorityQueue<PostingsEnum>(collectedTerms.size()) {
+              @Override
+              protected boolean lessThan(PostingsEnum a, PostingsEnum b) {
+                return a.cost() < b.cost();
+              }
+            };
+        DocIdSetBuilder otherTerms = new 
DocIdSetBuilder(context.reader().maxDoc(), terms);

Review Comment:
   minor: Could we define `otherTerms` closer to where it first gets used? 
(e.g., L:207)



##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -211,32 +218,39 @@ private WeightOrDocIdSet rewrite(LeafReaderContext 
context) throws IOException {
                 new ConstantScoreQuery(
                     new TermQuery(new Term(query.field, termsEnum.term()), 
termStates));
             Weight weight = searcher.rewrite(q).createWeight(searcher, 
scoreMode, score());
-            return new WeightOrDocIdSet(weight);
+            return new WeightOrDocIdSetIterator(weight);
           }
-          builder.add(docs);
+          PostingsEnum dropped = 
highFrequencyTerms.insertWithOverflow(postings);
+          otherTerms.add(dropped);
+          postings = dropped;
         } while (termsEnum.next() != null);
 
-        return new WeightOrDocIdSet(builder.build());
+        List<DocIdSetIterator> disis = new 
ArrayList<>(highFrequencyTerms.size() + 1);
+        for (PostingsEnum pe : highFrequencyTerms) {
+          disis.add(pe);
+        }
+        disis.add(otherTerms.build().iterator());
+        DisiPriorityQueue subs = new DisiPriorityQueue(disis.size());
+        for (DocIdSetIterator disi : disis) {
+          subs.add(new DisiWrapper(disi));
+        }

Review Comment:
   Maybe I'm overlooking something silly, but can't we just do one pass like 
this?
   
   ```suggestion
   DisiPriorityQueue subs = new DisiPriorityQueue(highFrequencyTerms.size() + 
1);
           for (DocIdSetIterator disi : highFrequencyTerms) {
             subs.add(new DisiWrapper(disi));
           }
           subs.add(new DisiWrapper(otherTerms.build().iterator()));
   ```



##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -183,23 +182,31 @@ private WeightOrDocIdSet rewrite(LeafReaderContext 
context) throws IOException {
           }
           Query q = new ConstantScoreQuery(bq.build());
           final Weight weight = searcher.rewrite(q).createWeight(searcher, 
scoreMode, score());
-          return new WeightOrDocIdSet(weight);
+          return new WeightOrDocIdSetIterator(weight);
         }
 
         // Too many terms: go back to the terms we already collected and start 
building the bit set

Review Comment:
   Can we update the comments to more accurately reflect the new logic? We 
don't really start building the bit set here.



##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -211,32 +218,39 @@ private WeightOrDocIdSet rewrite(LeafReaderContext 
context) throws IOException {
                 new ConstantScoreQuery(
                     new TermQuery(new Term(query.field, termsEnum.term()), 
termStates));
             Weight weight = searcher.rewrite(q).createWeight(searcher, 
scoreMode, score());
-            return new WeightOrDocIdSet(weight);
+            return new WeightOrDocIdSetIterator(weight);
           }
-          builder.add(docs);
+          PostingsEnum dropped = 
highFrequencyTerms.insertWithOverflow(postings);
+          otherTerms.add(dropped);
+          postings = dropped;
         } while (termsEnum.next() != null);
 
-        return new WeightOrDocIdSet(builder.build());
+        List<DocIdSetIterator> disis = new 
ArrayList<>(highFrequencyTerms.size() + 1);
+        for (PostingsEnum pe : highFrequencyTerms) {
+          disis.add(pe);
+        }
+        disis.add(otherTerms.build().iterator());
+        DisiPriorityQueue subs = new DisiPriorityQueue(disis.size());
+        for (DocIdSetIterator disi : disis) {
+          subs.add(new DisiWrapper(disi));
+        }

Review Comment:
   Also, it would be nice if we could get direct access to the underlying array 
backing `highFrequencyTerms`, then we could leverage `DisiPriorityQueue#addAll` 
to heapify everything at once.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Reply via email to