kaivalnp commented on code in PR #951: URL: https://github.com/apache/lucene/pull/951#discussion_r905065952
########## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ########## @@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k, Query filter) { public Query rewrite(IndexReader reader) throws IOException { TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()]; - BitSetCollector filterCollector = null; + Weight filterWeight = null; if (filter != null) { - filterCollector = new BitSetCollector(reader.leaves().size()); IndexSearcher indexSearcher = new IndexSearcher(reader); BooleanQuery booleanQuery = new BooleanQuery.Builder() .add(filter, BooleanClause.Occur.FILTER) .add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER) .build(); - indexSearcher.search(booleanQuery, filterCollector); + Query rewritten = indexSearcher.rewrite(booleanQuery); + filterWeight = indexSearcher.createWeight(rewritten, ScoreMode.COMPLETE_NO_SCORES, 1f); } for (LeafReaderContext ctx : reader.leaves()) { - TopDocs results = searchLeaf(ctx, filterCollector); + Bits acceptDocs; + int cost; + if (filterWeight != null) { + Scorer scorer = filterWeight.scorer(ctx); + if (scorer != null) { + DocIdSetIterator iterator = scorer.iterator(); + if (iterator instanceof BitSetIterator) { + acceptDocs = ((BitSetIterator) iterator).getBitSet(); + } else { + acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc()); + } + cost = (int) iterator.cost(); Review Comment: The approach you mentioned makes sense, and I have added it into the latest commit. But I feel we are missing out on some performance because of inaccurate estimation for `approximateSearch`. We could try @jpountz's suggestion of pro-rating the cost of the `BitSet` (since we only want some estimation), and performing `approximateSearch` on that. Something like: ```java Bits acceptDocs, liveDocs = ctx.reader().getLiveDocs(); int maxDoc = ctx.reader().maxDoc(), cost; DocIdSetIterator iterator = scorer.iterator(); if (iterator instanceof BitSetIterator bitSetIterator) { BitSet bitSet = bitSetIterator.getBitSet(); acceptDocs = new Bits() { @Override public boolean get(int index) { return bitSet.get(index) && (liveDocs == null || liveDocs.get(index)); } @Override public int length() { return maxDoc; } }; cost = bitSet.cardinality() * ctx.reader().numDocs() / maxDoc; TopDocs results = approximateSearch(ctx, acceptDocs, cost); if (results.totalHits.relation == TotalHits.Relation.EQUAL_TO) { return results; } else { FilteredDocIdSetIterator filterIterator = new FilteredDocIdSetIterator(iterator) { @Override protected boolean match(int doc) { return liveDocs == null || liveDocs.get(doc); } }; return exactSearch(ctx, filterIterator); } } ``` This way we won't traverse the `BitSet` even for deletions. If the limit is reached, we anyways have to iterate over the `iterator` to build a `BitSet`, so instead we can pass the filtered `iterator` to `exactSearch` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org