[GitHub] [lucene] rmuir commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

GitBox Sun, 01 Jan 2023 15:58:17 -0800


rmuir commented on code in PR #12055:
URL: https://github.com/apache/lucene/pull/12055#discussion_r1059807197



##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -183,23 +182,31 @@ private WeightOrDocIdSet rewrite(LeafReaderContext 
context) throws IOException {
           }
           Query q = new ConstantScoreQuery(bq.build());
           final Weight weight = searcher.rewrite(q).createWeight(searcher, 
scoreMode, score());
-          return new WeightOrDocIdSet(weight);
+          return new WeightOrDocIdSetIterator(weight);
         }
 
         // Too many terms: go back to the terms we already collected and start 
building the bit set
-        DocIdSetBuilder builder = new 
DocIdSetBuilder(context.reader().maxDoc(), terms);
+        PriorityQueue<PostingsEnum> highFrequencyTerms =
+            new PriorityQueue<PostingsEnum>(collectedTerms.size()) {
+              @Override
+              protected boolean lessThan(PostingsEnum a, PostingsEnum b) {
+                return a.cost() < b.cost();
+              }
+            };
+        DocIdSetBuilder otherTerms = new 
DocIdSetBuilder(context.reader().maxDoc(), terms);
         if (collectedTerms.isEmpty() == false) {
           TermsEnum termsEnum2 = terms.iterator();
           for (TermAndState t : collectedTerms) {
             termsEnum2.seekExact(t.term, t.state);
-            docs = termsEnum2.postings(docs, PostingsEnum.NONE);
-            builder.add(docs);
+            PostingsEnum postings = termsEnum2.postings(null, 
PostingsEnum.NONE);
+            highFrequencyTerms.add(postings);

Review Comment:
   Rather than just blindly add terms to the PQ, should we just have a constant 
mininum `cost` threshold (e.g. 256, 1024, whatever) to even consider it? 
otherwise go directly to `otherTerms`. The skipping stuff isn't going to be 
useful for the long-tail of low-cost terms (the majority, if we are thinking 
zipf). Ideally we wouldnt waste our time unless it has skipdata? And we want to 
be careful about the performance of these queries when there are jazillions of 
jazillions of matching low-frequency terms.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Reply via email to