Re: [PR] Refactor main top-n bulk scorers to evaluate hits in a more term-at-a-time fashion. [lucene]

via GitHub Fri, 23 May 2025 06:11:03 -0700


gf2121 commented on code in PR #14701:
URL: https://github.com/apache/lucene/pull/14701#discussion_r2104492611



##########
lucene/core/src/java/org/apache/lucene/search/ScorerUtil.java:
##########
@@ -117,4 +119,69 @@ public int length() {
       return in.length();
     }
   }
+
+  static void filterCompetitiveHits(
+      DocAndScoreAccBuffer buffer,
+      double maxRemainingScore,
+      float minCompetitiveScore,
+      int numScorers) {
+    int newSize = 0;
+    for (int i = 0; i < buffer.size; ++i) {
+      float maxPossibleScore =
+          (float) MathUtil.sumUpperBound(buffer.scores[i] + maxRemainingScore, 
numScorers);
+      if (maxPossibleScore >= minCompetitiveScore) {
+        buffer.docs[newSize] = buffer.docs[i];
+        buffer.scores[newSize] = buffer.scores[i];
+        newSize++;
+      }
+    }
+    buffer.size = newSize;
+  }
+
+  /**
+   * Apply the provided {@link Scorable} as a required clause on the given 
{@link
+   * DocAndScoreAccBuffer}. This filters out documents from the buffer that do 
not match, and adds
+   * the scores of this {@link Scorable} to the scores.
+   *
+   * <p><b>NOTE</b>: The provided buffer must contain doc IDs in sorted order, 
with no duplicates.
+   */
+  static void applyRequiredClause(
+      DocAndScoreAccBuffer buffer, DocIdSetIterator iterator, Scorable 
scorable)
+      throws IOException {
+    int intersectionSize = 0;
+    int curDoc = iterator.docID();
+    for (int i = 0; i < buffer.size; ++i) {
+      int targetDoc = buffer.docs[i];
+      if (curDoc < targetDoc) {
+        curDoc = iterator.advance(targetDoc);
+      }
+      if (curDoc == targetDoc) {

Review Comment:
   Out of curiosity, could `VectorUtil#findNextGTE` help here?



##########
lucene/core/src/java/org/apache/lucene/search/BlockMaxConjunctionBulkScorer.java:
##########
@@ -55,9 +55,6 @@ final class BlockMaxConjunctionBulkScorer extends BulkScorer {
     this.iterators =
         
Arrays.stream(this.scorers).map(Scorer::iterator).toArray(DocIdSetIterator[]::new);
     lead1 = ScorerUtil.likelyImpactsEnum(iterators[0]);
-    lead2 = ScorerUtil.likelyImpactsEnum(iterators[1]);
-    scorer1 = this.scorables[0];
-    scorer2 = this.scorables[1];
     this.sumOfOtherClauses = new double[this.scorers.length];
     for (int i = 0; i < sumOfOtherClauses.length; i++) {

Review Comment:
   Not introduced in this PR, but maybe change this to a `Arrays.fill` by the 
way :)



##########
lucene/core/src/java/org/apache/lucene/search/MaxScoreBulkScorer.java:
##########
@@ -394,31 +365,22 @@ void updateMaxWindowScores(int windowMin, int windowMax) 
throws IOException {
   }
 
   private void scoreNonEssentialClauses(
-      LeafCollector collector, int doc, double essentialScore, int 
numNonEssentialClauses)
+      LeafCollector collector, DocAndScoreAccBuffer buffer, int 
numNonEssentialClauses)
       throws IOException {
+    numCandidates += buffer.size;
 
-    ++numCandidates;
-
-    double score = essentialScore;
     for (int i = numNonEssentialClauses - 1; i >= 0; --i) {
-      float maxPossibleScore =
-          (float) MathUtil.sumUpperBound(score + maxScoreSums[i], 
allScorers.length);
-      if (maxPossibleScore < minCompetitiveScore) {
-        // Hit is not competitive.
-        return;
-      }
-
       DisiWrapper scorer = allScorers[i];
-      if (scorer.doc < doc) {
-        scorer.doc = scorer.iterator.advance(doc);
-      }
-      if (scorer.doc == doc) {
-        score += scorer.scorable.score();
-      }
+      ScorerUtil.filterCompetitiveHits(

Review Comment:
   Can we skip this if `minCompetitiveScore == 0f`, like other place?



##########
lucene/core/src/java/org/apache/lucene/search/BlockMaxConjunctionBulkScorer.java:
##########
@@ -38,11 +37,12 @@ final class BlockMaxConjunctionBulkScorer extends 
BulkScorer {
   private final Scorer[] scorers;
   private final Scorable[] scorables;
   private final DocIdSetIterator[] iterators;
-  private final DocIdSetIterator lead1, lead2;
-  private final Scorable scorer1, scorer2;
+  private final DocIdSetIterator lead1;

Review Comment:
   Maybe just name it `lead` since we do not have `lead2` now.



##########
lucene/core/src/java/org/apache/lucene/search/BlockMaxConjunctionBulkScorer.java:
##########
@@ -118,98 +115,36 @@ private void scoreWindow(
       return;
     }
 
-    Scorable scorer1 = this.scorer1;
-    if (scorers[0].getMaxScore(max - 1) == 0f) {
-      // Null out scorer1 if it may only produce 0 scores over this window. In 
practice, this is
-      // mostly useful because FILTER clauses are pushed as constant-scoring 
MUST clauses with a
-      // 0 score to this scorer. Setting it to null instead of using a 
different impl helps
-      // reduce polymorphism of calls to Scorable#score and skip the check of 
whether the leading
-      // clause produced a high-enough score for the doc to be competitive.
-      scorer1 = null;
-    }
-
-    final double sumOfOtherMaxScoresAt1 = sumOfOtherClauses[1];
-
-    advanceHead:
-    for (int doc = lead1.docID(); doc < max; ) {
-      if (acceptDocs != null && acceptDocs.get(doc) == false) {
-        doc = lead1.nextDoc();
-        continue;
-      }
+    for (scorers[0].nextDocsAndScores(max, acceptDocs, docAndScoreBuffer);
+        docAndScoreBuffer.size > 0;
+        scorers[0].nextDocsAndScores(max, acceptDocs, docAndScoreBuffer)) {
 
-      // Compute the score as we find more matching clauses, in order to skip 
advancing other
-      // clauses if the total score has no chance of being competitive. This 
works well because
-      // computing a score is usually cheaper than decoding a full block of 
postings and
-      // frequencies.
-      final boolean hasMinCompetitiveScore = scorable.minCompetitiveScore > 0;
-      double currentScore;
-      if (scorer1 != null && hasMinCompetitiveScore) {
-        currentScore = scorer1.score();
-
-        // This is the same logic as in the below for loop, specialized for 
the 2nd least costly
-        // clause. This seems to help the JVM.
-
-        // First check if we have a chance of having a match based on max 
scores
-        if ((float) MathUtil.sumUpperBound(currentScore + 
sumOfOtherMaxScoresAt1, scorers.length)
-            < scorable.minCompetitiveScore) {
-          doc = lead1.nextDoc();
-          continue advanceHead;
-        }
-      } else {
-        currentScore = 0;
-      }
-
-      // NOTE: lead2 may be on `doc` already if we `continue`d on the previous 
loop iteration.
-      if (lead2.docID() < doc) {
-        int next = lead2.advance(doc);
-        if (next != doc) {
-          doc = lead1.advance(next);
-          continue advanceHead;
-        }
-      }
-      assert lead2.docID() == doc;
-      if (hasMinCompetitiveScore) {
-        currentScore += scorer2.score();
-      }
+      docAndScoreAccBuffer.copyFrom(docAndScoreBuffer);
 
-      for (int i = 2; i < iterators.length; ++i) {
-        // First check if we have a chance of having a match based on max 
scores
-        if (hasMinCompetitiveScore
-            && (float) MathUtil.sumUpperBound(currentScore + 
sumOfOtherClauses[i], scorers.length)
-                < scorable.minCompetitiveScore) {
-          doc = lead1.nextDoc();
-          continue advanceHead;
+      for (int i = 1; i < scorers.length; ++i) {
+        if (scorable.minCompetitiveScore > 0) {
+          ScorerUtil.filterCompetitiveHits(
+              docAndScoreAccBuffer,
+              sumOfOtherClauses[i],
+              scorable.minCompetitiveScore,
+              scorers.length);
         }
 
-        // NOTE: these iterators may be on `doc` already if we called 
`continue advanceHead` on the
-        // previous loop iteration.
-        if (iterators[i].docID() < doc) {
-          int next = iterators[i].advance(doc);
-          if (next != doc) {
-            doc = lead1.advance(next);
-            continue advanceHead;
-          }
-        }
-        assert iterators[i].docID() == doc;
-        if (hasMinCompetitiveScore) {
-          currentScore += scorables[i].score();
-        }
+        ScorerUtil.applyRequiredClause(docAndScoreAccBuffer, iterators[i], 
scorables[i]);
       }
 
-      if (hasMinCompetitiveScore == false) {
-        for (Scorable scorer : scorables) {
-          currentScore += scorer.score();
-        }
-      }
-      scorable.score = (float) currentScore;
-      collector.collect(doc);
-      // The collect() call may have updated the minimum competitive score.
-      if (maxWindowScore < scorable.minCompetitiveScore) {
-        // no more hits are competitive
-        return;
+      for (int i = 0; i < docAndScoreAccBuffer.size; ++i) {
+        scorable.score = (float) docAndScoreAccBuffer.scores[i];
+        collector.collect(docAndScoreAccBuffer.docs[i]);
       }
+    }
 
-      doc = lead1.nextDoc();
+    int maxOtherDoc = -1;
+    for (int i = 0; i < iterators.length; ++i) {

Review Comment:
   Not important but since we are naming `maxOtherDoc` let's start `i` from 1?



##########
lucene/core/src/java/org/apache/lucene/search/BlockMaxConjunctionBulkScorer.java:
##########
@@ -118,98 +115,36 @@ private void scoreWindow(
       return;
     }
 
-    Scorable scorer1 = this.scorer1;
-    if (scorers[0].getMaxScore(max - 1) == 0f) {

Review Comment:
   I was wondering why the `FilteredAnd` tasks, which use 
`BlockMaxConjunctionBulkScorer`, are getting more slow down than other tasks. 
Could the removal of this special optimization be the reason? Do we intend to 
remove this for simplification?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Refactor main top-n bulk scorers to evaluate hits in a more term-at-a-time fashion. [lucene]

Reply via email to