kaivalnp commented on code in PR #951:
URL: https://github.com/apache/lucene/pull/951#discussion_r905065952
##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -92,20 +91,40 @@ public KnnVectorQuery(String field, float[] target, int k,
Query filter) {
public Query rewrite(IndexReader reader) throws IOException {
TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
- BitSetCollector filterCollector = null;
+ Weight filterWeight = null;
if (filter != null) {
- filterCollector = new BitSetCollector(reader.leaves().size());
IndexSearcher indexSearcher = new IndexSearcher(reader);
BooleanQuery booleanQuery =
new BooleanQuery.Builder()
.add(filter, BooleanClause.Occur.FILTER)
.add(new FieldExistsQuery(field), BooleanClause.Occur.FILTER)
.build();
- indexSearcher.search(booleanQuery, filterCollector);
+ Query rewritten = indexSearcher.rewrite(booleanQuery);
+ filterWeight = indexSearcher.createWeight(rewritten,
ScoreMode.COMPLETE_NO_SCORES, 1f);
}
for (LeafReaderContext ctx : reader.leaves()) {
- TopDocs results = searchLeaf(ctx, filterCollector);
+ Bits acceptDocs;
+ int cost;
+ if (filterWeight != null) {
+ Scorer scorer = filterWeight.scorer(ctx);
+ if (scorer != null) {
+ DocIdSetIterator iterator = scorer.iterator();
+ if (iterator instanceof BitSetIterator) {
+ acceptDocs = ((BitSetIterator) iterator).getBitSet();
+ } else {
+ acceptDocs = BitSet.of(iterator, ctx.reader().maxDoc());
+ }
+ cost = (int) iterator.cost();
Review Comment:
The approach you mentioned makes sense, and I have added it into the latest
commit.
But I feel we are missing out on some performance because of inaccurate
estimation for `approximateSearch`. We could try @jpountz's suggestion of
pro-rating the cost of the `BitSet` (since we only want some estimation), and
performing `approximateSearch` on that. Something like:
```java
Bits acceptDocs, liveDocs = ctx.reader().getLiveDocs();
int maxDoc = ctx.reader().maxDoc(), cost;
DocIdSetIterator iterator = scorer.iterator();
if (iterator instanceof BitSetIterator bitSetIterator) {
BitSet bitSet = bitSetIterator.getBitSet();
acceptDocs = new Bits() {
@Override
public boolean get(int index) {
return bitSet.get(index) && (liveDocs == null || liveDocs.get(index));
}
@Override
public int length() {
return maxDoc;
}
};
cost = bitSet.cardinality() * ctx.reader().numDocs() / maxDoc;
TopDocs results = approximateSearch(ctx, acceptDocs, cost);
if (results.totalHits.relation == TotalHits.Relation.EQUAL_TO) {
return results;
} else {
FilteredDocIdSetIterator filterIterator = new
FilteredDocIdSetIterator(iterator) {
@Override
protected boolean match(int doc) {
return liveDocs == null || liveDocs.get(doc);
}
};
return exactSearch(ctx, filterIterator);
}
}
```
This way we won't traverse the `BitSet` even for deletions. If the limit is
reached, we anyways have to iterate over the `iterator` to build a `BitSet`, so
instead we can pass the filtered `iterator` to `exactSearch`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]