Re: [PR] Re-use information from graph traversal during exact search [lucene]

via GitHub Mon, 20 Nov 2023 10:35:27 -0800


vigyasharma commented on code in PR #12820:
URL: https://github.com/apache/lucene/pull/12820#discussion_r1399533915



##########
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##########
@@ -155,14 +159,20 @@ protected boolean match(int doc) {
     }
   }
 
-  protected abstract TopDocs approximateSearch(
-      LeafReaderContext context, Bits acceptDocs, int visitedLimit) throws 
IOException;
+  protected KnnCollector getCollector(LeafReaderContext context, int 
visitLimit)

Review Comment:
   Why do we need `context` here? We don't seem to be using it.



##########
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##########
@@ -109,32 +109,36 @@ private TopDocs getLeafResults(LeafReaderContext ctx, 
Weight filterWeight) throw
     Bits liveDocs = ctx.reader().getLiveDocs();
     int maxDoc = ctx.reader().maxDoc();
 
+    Bits acceptDocs;
+    int cost;
     if (filterWeight == null) {
-      return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE);
+      acceptDocs = liveDocs;
+      cost = Integer.MAX_VALUE;
+    } else {
+      Scorer scorer = filterWeight.scorer(ctx);
+      if (scorer == null) {
+        return NO_RESULTS;
+      }
+      acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc);
+      cost = ((BitSet) acceptDocs).cardinality();
     }
 
-    Scorer scorer = filterWeight.scorer(ctx);
-    if (scorer == null) {
+    KnnCollector collector = getCollector(ctx, cost);
+    if (collector == null) {
       return NO_RESULTS;
     }
 
-    BitSet acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc);
-    int cost = acceptDocs.cardinality();
-
-    if (cost <= k) {
-      // If there are <= k possible matches, short-circuit and perform exact 
search, since HNSW
-      // must always visit at least k documents
-      return exactSearch(ctx, new BitSetIterator(acceptDocs, cost));
+    if (cost > k) {
+      // Perform the approximate kNN search
+      approximateSearch(ctx, acceptDocs, collector);
     }
 
-    // Perform the approximate kNN search
-    TopDocs results = approximateSearch(ctx, acceptDocs, cost);
-    if (results.totalHits.relation == TotalHits.Relation.EQUAL_TO) {
-      return results;
-    } else {
-      // We stopped the kNN search because it visited too many nodes, so fall 
back to exact search
-      return exactSearch(ctx, new BitSetIterator(acceptDocs, cost));
+    if ((cost <= k || collector.earlyTerminated()) && acceptDocs instanceof 
BitSet bitSet) {
+      // If there are <= k possible matches, or we stopped the kNN search 
because it visited too
+      // many nodes, fall back to exact search
+      return exactSearch(ctx, new BitSetIterator(bitSet, cost), collector);

Review Comment:
   Nice! This should save us from recomputing similarity on docs that we know 
will get rejected later.



##########
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##########
@@ -155,14 +159,20 @@ protected boolean match(int doc) {
     }
   }
 
-  protected abstract TopDocs approximateSearch(
-      LeafReaderContext context, Bits acceptDocs, int visitedLimit) throws 
IOException;
+  protected KnnCollector getCollector(LeafReaderContext context, int 
visitLimit)

Review Comment:
   Ah I see we have trouble with `DiversifyingNearestChildrenKnnCollector` 
which needs a parent filter bitset to initialize. So approximateSearch() tends 
to create its own collectors. But our problem here is we want to share the 
collector between two different types of searches. I get the workaround you 
have here, but it feels a little trappy to me. Sorry I don't have a good 
solution either, I need to think about this more deeply... It might be a bigger 
change.



##########
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##########
@@ -171,33 +181,23 @@ protected TopDocs exactSearch(LeafReaderContext context, 
DocIdSetIterator accept
     }
 
     VectorScorer vectorScorer = createVectorScorer(context, fi);
-    HitQueue queue = new HitQueue(k, true);
-    ScoreDoc topDoc = queue.top();
     int doc;
     while ((doc = acceptIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+      if (collector.getVisited(doc)) {
+        continue;
+      }
+
       boolean advanced = vectorScorer.advanceExact(doc);
       assert advanced;
 
       float score = vectorScorer.score();
-      if (score > topDoc.score) {
-        topDoc.score = score;
-        topDoc.doc = doc;
-        topDoc = queue.updateTop();
+      if (score > collector.minCompetitiveSimilarity()) {
+        collector.collect(doc, score);
       }
     }
 
-    // Remove any remaining sentinel values
-    while (queue.size() > 0 && queue.top().score < 0) {
-      queue.pop();
-    }
-
-    ScoreDoc[] topScoreDocs = new ScoreDoc[queue.size()];
-    for (int i = topScoreDocs.length - 1; i >= 0; i--) {
-      topScoreDocs[i] = queue.pop();
-    }
-
     TotalHits totalHits = new TotalHits(acceptIterator.cost(), 
TotalHits.Relation.EQUAL_TO);
-    return new TopDocs(totalHits, topScoreDocs);
+    return new TopDocs(totalHits, collector.topDocs().scoreDocs);

Review Comment:
   Ah, so we cannot return `collector.topDocs()` like we do [here in 
LeafReader](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/LeafReader.java#L259),
 because the collector has actually visited more nodes than the filter. Good 
catch!



##########
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##########
@@ -171,33 +181,23 @@ protected TopDocs exactSearch(LeafReaderContext context, 
DocIdSetIterator accept
     }
 
     VectorScorer vectorScorer = createVectorScorer(context, fi);
-    HitQueue queue = new HitQueue(k, true);
-    ScoreDoc topDoc = queue.top();
     int doc;
     while ((doc = acceptIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+      if (collector.getVisited(doc)) {
+        continue;
+      }
+
       boolean advanced = vectorScorer.advanceExact(doc);
       assert advanced;
 
       float score = vectorScorer.score();
-      if (score > topDoc.score) {
-        topDoc.score = score;
-        topDoc.doc = doc;
-        topDoc = queue.updateTop();
+      if (score > collector.minCompetitiveSimilarity()) {

Review Comment:
   I think using `minCompetitiveSimilarity()` here makes sense. Worth noting 
that once we have https://github.com/apache/lucene/pull/12794, `exactSearch()` 
will start competing with global competitive scores, something that we don't 
currently do. 
   I don't think it should be a problem though, in fact, it seems like the 
right behavior.



##########
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##########
@@ -109,32 +109,36 @@ private TopDocs getLeafResults(LeafReaderContext ctx, 
Weight filterWeight) throw
     Bits liveDocs = ctx.reader().getLiveDocs();
     int maxDoc = ctx.reader().maxDoc();
 
+    Bits acceptDocs;
+    int cost;
     if (filterWeight == null) {
-      return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE);
+      acceptDocs = liveDocs;
+      cost = Integer.MAX_VALUE;

Review Comment:
   I see you're trying to use `acceptDocs` across the different search calls 
here, but I wonder if this particular instance is worth changing. It requires 
us to use `Bits acceptDocs` instead of a `Bitset`, due to which you have to 
upcast it later in this method.
   
   Since we'll always do approximateSearch when filterWeight is null (there can 
be no exact search), can we leave the `return approximateSearch(ctx, liveDocs, 
Integer.MAX_VALUE);` as is? 



##########
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##########
@@ -109,32 +109,36 @@ private TopDocs getLeafResults(LeafReaderContext ctx, 
Weight filterWeight) throw
     Bits liveDocs = ctx.reader().getLiveDocs();
     int maxDoc = ctx.reader().maxDoc();
 
+    Bits acceptDocs;
+    int cost;
     if (filterWeight == null) {
-      return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE);
+      acceptDocs = liveDocs;
+      cost = Integer.MAX_VALUE;
+    } else {
+      Scorer scorer = filterWeight.scorer(ctx);
+      if (scorer == null) {
+        return NO_RESULTS;
+      }
+      acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc);
+      cost = ((BitSet) acceptDocs).cardinality();
     }
 
-    Scorer scorer = filterWeight.scorer(ctx);
-    if (scorer == null) {
+    KnnCollector collector = getCollector(ctx, cost);
+    if (collector == null) {

Review Comment:
   Can we get a `null` collector here?



##########
lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java:
##########
@@ -76,11 +73,11 @@ public KnnFloatVectorQuery(String field, float[] target, 
int k, Query filter) {
   }
 
   @Override
-  protected TopDocs approximateSearch(LeafReaderContext context, Bits 
acceptDocs, int visitedLimit)
-      throws IOException {
-    TopDocs results =
-        context.reader().searchNearestVectors(field, target, k, acceptDocs, 
visitedLimit);
-    return results != null ? results : NO_RESULTS;
+  protected void approximateSearch(
+      LeafReaderContext context, Bits acceptDocs, KnnCollector collector) 
throws IOException {
+    if (collector != null) {

Review Comment:
   When would we get a `null` collector here?



##########
lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java:
##########
@@ -158,59 +95,4 @@ public int hashCode() {
     result = 31 * result + Arrays.hashCode(query);
     return result;
   }
-
-  private static class ParentBlockJoinByteVectorScorer {

Review Comment:
   Do we not need this code anymore?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Re-use information from graph traversal during exact search [lucene]

Reply via email to