jtibshirani commented on code in PR #951: URL: https://github.com/apache/lucene/pull/951#discussion_r906418634
########## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ########## @@ -121,36 +120,50 @@ public Query rewrite(IndexReader reader) throws IOException { return createRewrittenQuery(reader, topK); } - private TopDocs searchLeaf(LeafReaderContext ctx, BitSetCollector filterCollector) - throws IOException { + private TopDocs searchLeaf(LeafReaderContext ctx, Weight filterWeight) throws IOException { + Bits liveDocs = ctx.reader().getLiveDocs(); + int maxDoc = ctx.reader().maxDoc(); - if (filterCollector == null) { - Bits acceptDocs = ctx.reader().getLiveDocs(); - return approximateSearch(ctx, acceptDocs, Integer.MAX_VALUE); + if (filterWeight == null) { + return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE); } else { - BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord); - if (filterIterator == null || filterIterator.cost() == 0) { + Scorer scorer = filterWeight.scorer(ctx); + if (scorer == null) { return NO_RESULTS; - } + } else { + BitSetIterator filterIterator = + cacheIntoBitSetIterator(scorer.iterator(), liveDocs, maxDoc); - if (filterIterator.cost() <= k) { - // If there are <= k possible matches, short-circuit and perform exact search, since HNSW - // must always visit at least k documents - return exactSearch(ctx, filterIterator); + if (filterIterator.cost() <= k) { + return exactSearch(ctx, filterIterator); Review Comment: Could we restore all the comments in this section? I think they're helpful in understanding the algorithm. ########## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ########## @@ -121,36 +120,50 @@ public Query rewrite(IndexReader reader) throws IOException { return createRewrittenQuery(reader, topK); } - private TopDocs searchLeaf(LeafReaderContext ctx, BitSetCollector filterCollector) - throws IOException { + private TopDocs searchLeaf(LeafReaderContext ctx, Weight filterWeight) throws IOException { + Bits liveDocs = ctx.reader().getLiveDocs(); + int maxDoc = ctx.reader().maxDoc(); - if (filterCollector == null) { - Bits acceptDocs = ctx.reader().getLiveDocs(); - return approximateSearch(ctx, acceptDocs, Integer.MAX_VALUE); + if (filterWeight == null) { + return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE); } else { - BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord); - if (filterIterator == null || filterIterator.cost() == 0) { + Scorer scorer = filterWeight.scorer(ctx); Review Comment: Small suggestion, I often like to remove the "else" when the "if" statement has already returned a value. This avoids having a lot of highly nested else/ if statements. This suggestion applies to a few places in this method. ########## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ########## @@ -121,36 +120,50 @@ public Query rewrite(IndexReader reader) throws IOException { return createRewrittenQuery(reader, topK); } - private TopDocs searchLeaf(LeafReaderContext ctx, BitSetCollector filterCollector) - throws IOException { + private TopDocs searchLeaf(LeafReaderContext ctx, Weight filterWeight) throws IOException { + Bits liveDocs = ctx.reader().getLiveDocs(); + int maxDoc = ctx.reader().maxDoc(); - if (filterCollector == null) { - Bits acceptDocs = ctx.reader().getLiveDocs(); - return approximateSearch(ctx, acceptDocs, Integer.MAX_VALUE); + if (filterWeight == null) { + return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE); } else { - BitSetIterator filterIterator = filterCollector.getIterator(ctx.ord); - if (filterIterator == null || filterIterator.cost() == 0) { + Scorer scorer = filterWeight.scorer(ctx); + if (scorer == null) { return NO_RESULTS; - } + } else { + BitSetIterator filterIterator = + cacheIntoBitSetIterator(scorer.iterator(), liveDocs, maxDoc); - if (filterIterator.cost() <= k) { - // If there are <= k possible matches, short-circuit and perform exact search, since HNSW - // must always visit at least k documents - return exactSearch(ctx, filterIterator); + if (filterIterator.cost() <= k) { + return exactSearch(ctx, filterIterator); + } + TopDocs results = + approximateSearch(ctx, filterIterator.getBitSet(), (int) filterIterator.cost()); + if (results.totalHits.relation == TotalHits.Relation.EQUAL_TO) { + return results; + } else { + return exactSearch(ctx, filterIterator); + } } + } + } - // Perform the approximate kNN search - Bits acceptDocs = - filterIterator.getBitSet(); // The filter iterator already incorporates live docs - int visitedLimit = (int) filterIterator.cost(); - TopDocs results = approximateSearch(ctx, acceptDocs, visitedLimit); - if (results.totalHits.relation == TotalHits.Relation.EQUAL_TO) { - return results; - } else { - // We stopped the kNN search because it visited too many nodes, so fall back to exact search - return exactSearch(ctx, filterIterator); - } + private BitSetIterator cacheIntoBitSetIterator( Review Comment: Small comment, maybe it's clearer to return a `BitSet` here? Then we could just take the cardinality in the calling method. It was actually a little hacky before that we were using `BitSetIterator#cost` to always represent the cardinality (my fault!), when the cost is a separate concept from cardinality. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org