kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1766741617
> If I read correctly, this query ends up calling LeafReader#searchNearestNeighbors with k=Integer.MAX_VALUE No, we're calling the [new API](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/LeafReader.java#L329) (from [here](https://github.com/apache/lucene/blob/a384967b5ea631de940e327f4483888c80d09611/lucene/core/src/java/org/apache/lucene/search/RnnFloatVectorQuery.java#L57)) with a custom [`RnnCollector`](https://github.com/apache/lucene/blob/a384967b5ea631de940e327f4483888c80d09611/lucene/core/src/java/org/apache/lucene/search/RnnCollector.java#L28) that performs score-based HNSW searches (as opposed to the [old API](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/LeafReader.java#L246-L247) that performs `topK`-based searches with `k=Integer.MAX_VALUE`) The `Integer.MAX_VALUE` passed [here](https://github.com/apache/lucene/blob/a384967b5ea631de940e327f4483888c80d09611/lucene/core/src/java/org/apache/lucene/search/AbstractRnnVectorQuery.java#L38) is just used in two places: [`#exactSearch`](https://github.com/apache/lucene/blob/a384967b5ea631de940e327f4483888c80d09611/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L174) (to instantiate a priority queue of size `k`) and [`#mergeLeafResults`](https://github.com/apache/lucene/blob/a384967b5ea631de940e327f4483888c80d09611/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L216) (to request for the best-scoring `k` hits across all segment results). We're overriding both functions in our implementation of `AbstractRnnVectorQuery` (because we do not want to limit to `topK` results) I think you're worried that we'll end up performing brute-force KNN on all documents in the segment, and *then* retain vectors above the threshold? What we instead aim to do is: starting from the entry node in the last level of HNSW graphs, we keep visiting candidates as long as they are above the `traversalThreshold`, all the while adding nodes above the `resultThreshold` as accepted results This is not necessarily slower than normal HNSW searches, provided the `traversalThreshold` is chosen suitably -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org