Re: [PR] Add support for radius-based vector searches [lucene]


kaivalnp commented on PR #12679:
URL: https://github.com/apache/lucene/pull/12679#issuecomment-1766741617


   > If I read correctly, this query ends up calling 
LeafReader#searchNearestNeighbors with k=Integer.MAX_VALUE
   
   No, we're calling the [new 
API](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/LeafReader.java#L329)
 (from 
[here](https://github.com/apache/lucene/blob/a384967b5ea631de940e327f4483888c80d09611/lucene/core/src/java/org/apache/lucene/search/RnnFloatVectorQuery.java#L57))
 with a custom 
[`RnnCollector`](https://github.com/apache/lucene/blob/a384967b5ea631de940e327f4483888c80d09611/lucene/core/src/java/org/apache/lucene/search/RnnCollector.java#L28)
 that performs score-based HNSW searches (as opposed to the [old 
API](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/LeafReader.java#L246-L247)
 that performs `topK`-based searches with `k=Integer.MAX_VALUE`)
   
   The `Integer.MAX_VALUE` passed 
[here](https://github.com/apache/lucene/blob/a384967b5ea631de940e327f4483888c80d09611/lucene/core/src/java/org/apache/lucene/search/AbstractRnnVectorQuery.java#L38)
 is just used in two places: 
[`#exactSearch`](https://github.com/apache/lucene/blob/a384967b5ea631de940e327f4483888c80d09611/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L174)
 (to instantiate a priority queue of size `k`) and 
[`#mergeLeafResults`](https://github.com/apache/lucene/blob/a384967b5ea631de940e327f4483888c80d09611/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L216)
 (to request for the best-scoring `k` hits across all segment results). We're 
overriding both functions in our implementation of `AbstractRnnVectorQuery` 
(because we do not want to limit to `topK` results)
   
   I think you're worried that we'll end up performing brute-force KNN on all 
documents in the segment, and *then* retain vectors above the threshold? What 
we instead aim to do is: starting from the entry node in the last level of HNSW 
graphs, we keep visiting candidates as long as they are above the 
`traversalThreshold`, all the while adding nodes above the `resultThreshold` as 
accepted results
   
   This is not necessarily slower than normal HNSW searches, provided the 
`traversalThreshold` is chosen suitably


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for radius-based vector searches [lucene]

Reply via email to