vigyasharma commented on code in PR #12820: URL: https://github.com/apache/lucene/pull/12820#discussion_r1399533915
########## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ########## @@ -155,14 +159,20 @@ protected boolean match(int doc) { } } - protected abstract TopDocs approximateSearch( - LeafReaderContext context, Bits acceptDocs, int visitedLimit) throws IOException; + protected KnnCollector getCollector(LeafReaderContext context, int visitLimit) Review Comment: Why do we need `context` here? We don't seem to be using it. ########## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ########## @@ -109,32 +109,36 @@ private TopDocs getLeafResults(LeafReaderContext ctx, Weight filterWeight) throw Bits liveDocs = ctx.reader().getLiveDocs(); int maxDoc = ctx.reader().maxDoc(); + Bits acceptDocs; + int cost; if (filterWeight == null) { - return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE); + acceptDocs = liveDocs; + cost = Integer.MAX_VALUE; + } else { + Scorer scorer = filterWeight.scorer(ctx); + if (scorer == null) { + return NO_RESULTS; + } + acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc); + cost = ((BitSet) acceptDocs).cardinality(); } - Scorer scorer = filterWeight.scorer(ctx); - if (scorer == null) { + KnnCollector collector = getCollector(ctx, cost); + if (collector == null) { return NO_RESULTS; } - BitSet acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc); - int cost = acceptDocs.cardinality(); - - if (cost <= k) { - // If there are <= k possible matches, short-circuit and perform exact search, since HNSW - // must always visit at least k documents - return exactSearch(ctx, new BitSetIterator(acceptDocs, cost)); + if (cost > k) { + // Perform the approximate kNN search + approximateSearch(ctx, acceptDocs, collector); } - // Perform the approximate kNN search - TopDocs results = approximateSearch(ctx, acceptDocs, cost); - if (results.totalHits.relation == TotalHits.Relation.EQUAL_TO) { - return results; - } else { - // We stopped the kNN search because it visited too many nodes, so fall back to exact search - return exactSearch(ctx, new BitSetIterator(acceptDocs, cost)); + if ((cost <= k || collector.earlyTerminated()) && acceptDocs instanceof BitSet bitSet) { + // If there are <= k possible matches, or we stopped the kNN search because it visited too + // many nodes, fall back to exact search + return exactSearch(ctx, new BitSetIterator(bitSet, cost), collector); Review Comment: Nice! This should save us from recomputing similarity on docs that we know will get rejected later. ########## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ########## @@ -155,14 +159,20 @@ protected boolean match(int doc) { } } - protected abstract TopDocs approximateSearch( - LeafReaderContext context, Bits acceptDocs, int visitedLimit) throws IOException; + protected KnnCollector getCollector(LeafReaderContext context, int visitLimit) Review Comment: Ah I see we have trouble with `DiversifyingNearestChildrenKnnCollector` which needs a parent filter bitset to initialize. So approximateSearch() tends to create its own collectors. But our problem here is we want to share the collector between two different types of searches. I get the workaround you have here, but it feels a little trappy to me. Sorry I don't have a good solution either, I need to think about this more deeply... It might be a bigger change. ########## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ########## @@ -171,33 +181,23 @@ protected TopDocs exactSearch(LeafReaderContext context, DocIdSetIterator accept } VectorScorer vectorScorer = createVectorScorer(context, fi); - HitQueue queue = new HitQueue(k, true); - ScoreDoc topDoc = queue.top(); int doc; while ((doc = acceptIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { + if (collector.getVisited(doc)) { + continue; + } + boolean advanced = vectorScorer.advanceExact(doc); assert advanced; float score = vectorScorer.score(); - if (score > topDoc.score) { - topDoc.score = score; - topDoc.doc = doc; - topDoc = queue.updateTop(); + if (score > collector.minCompetitiveSimilarity()) { + collector.collect(doc, score); } } - // Remove any remaining sentinel values - while (queue.size() > 0 && queue.top().score < 0) { - queue.pop(); - } - - ScoreDoc[] topScoreDocs = new ScoreDoc[queue.size()]; - for (int i = topScoreDocs.length - 1; i >= 0; i--) { - topScoreDocs[i] = queue.pop(); - } - TotalHits totalHits = new TotalHits(acceptIterator.cost(), TotalHits.Relation.EQUAL_TO); - return new TopDocs(totalHits, topScoreDocs); + return new TopDocs(totalHits, collector.topDocs().scoreDocs); Review Comment: Ah, so we cannot return `collector.topDocs()` like we do [here in LeafReader](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/LeafReader.java#L259), because the collector has actually visited more nodes than the filter. Good catch! ########## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ########## @@ -171,33 +181,23 @@ protected TopDocs exactSearch(LeafReaderContext context, DocIdSetIterator accept } VectorScorer vectorScorer = createVectorScorer(context, fi); - HitQueue queue = new HitQueue(k, true); - ScoreDoc topDoc = queue.top(); int doc; while ((doc = acceptIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { + if (collector.getVisited(doc)) { + continue; + } + boolean advanced = vectorScorer.advanceExact(doc); assert advanced; float score = vectorScorer.score(); - if (score > topDoc.score) { - topDoc.score = score; - topDoc.doc = doc; - topDoc = queue.updateTop(); + if (score > collector.minCompetitiveSimilarity()) { Review Comment: I think using `minCompetitiveSimilarity()` here makes sense. Worth noting that once we have https://github.com/apache/lucene/pull/12794, `exactSearch()` will start competing with global competitive scores, something that we don't currently do. I don't think it should be a problem though, in fact, it seems like the right behavior. ########## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ########## @@ -109,32 +109,36 @@ private TopDocs getLeafResults(LeafReaderContext ctx, Weight filterWeight) throw Bits liveDocs = ctx.reader().getLiveDocs(); int maxDoc = ctx.reader().maxDoc(); + Bits acceptDocs; + int cost; if (filterWeight == null) { - return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE); + acceptDocs = liveDocs; + cost = Integer.MAX_VALUE; Review Comment: I see you're trying to use `acceptDocs` across the different search calls here, but I wonder if this particular instance is worth changing. It requires us to use `Bits acceptDocs` instead of a `Bitset`, due to which you have to upcast it later in this method. Since we'll always do approximateSearch when filterWeight is null (there can be no exact search), can we leave the `return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE);` as is? ########## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ########## @@ -109,32 +109,36 @@ private TopDocs getLeafResults(LeafReaderContext ctx, Weight filterWeight) throw Bits liveDocs = ctx.reader().getLiveDocs(); int maxDoc = ctx.reader().maxDoc(); + Bits acceptDocs; + int cost; if (filterWeight == null) { - return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE); + acceptDocs = liveDocs; + cost = Integer.MAX_VALUE; + } else { + Scorer scorer = filterWeight.scorer(ctx); + if (scorer == null) { + return NO_RESULTS; + } + acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc); + cost = ((BitSet) acceptDocs).cardinality(); } - Scorer scorer = filterWeight.scorer(ctx); - if (scorer == null) { + KnnCollector collector = getCollector(ctx, cost); + if (collector == null) { Review Comment: Can we get a `null` collector here? ########## lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java: ########## @@ -76,11 +73,11 @@ public KnnFloatVectorQuery(String field, float[] target, int k, Query filter) { } @Override - protected TopDocs approximateSearch(LeafReaderContext context, Bits acceptDocs, int visitedLimit) - throws IOException { - TopDocs results = - context.reader().searchNearestVectors(field, target, k, acceptDocs, visitedLimit); - return results != null ? results : NO_RESULTS; + protected void approximateSearch( + LeafReaderContext context, Bits acceptDocs, KnnCollector collector) throws IOException { + if (collector != null) { Review Comment: When would we get a `null` collector here? ########## lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java: ########## @@ -158,59 +95,4 @@ public int hashCode() { result = 31 * result + Arrays.hashCode(query); return result; } - - private static class ParentBlockJoinByteVectorScorer { Review Comment: Do we not need this code anymore? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org