kaivalnp opened a new pull request, #12679:
URL: https://github.com/apache/lucene/pull/12679

   ### Description
   
   Background in #12579
   
   Add support for getting "all vectors within a radius" as opposed to getting 
the "topK closest vectors" in the current system
   
   ### Considerations
   
   I've tried to keep this change minimal and non-invasive by not modifying any 
APIs and re-using existing HNSW graphs -- changing the graph traversal and 
result collection criteria to:
   1. Visit all nodes (reachable from the entry node in the last level) that 
are within an outer "traversal" radius
   2. Collect all nodes that are within an inner "result" radius
   
   ### Advantages
   
   1. Queries that have a high number of "relevant" results will get all of 
those (not limited by `topK`)
   2. Conversely, arbitrary queries where many results are not "relevant" will 
not waste time in getting all `topK` (when some of them will be removed later)
   3. Results of HNSW searches need not be sorted - and we can store them in a 
plain list as opposed to min-max heaps (saving on `heapify` calls). Merging 
results from segments is also cheaper, where we just concatenate results as 
opposed to calculating the index-level `topK`
   
   On a higher level, finding `topK` results needed HNSW searches to happen in 
`#rewrite` because of an interdependence of results between segments - where we 
want to find the index-level `topK` from multiple segment-level results. This 
is kind of against Lucene's concept of segments being independently searchable 
sub-indexes?
   
   Moreover, we needed explicit concurrency (#12160) to perform these in 
parallel, and these shortcomings would be naturally overcome with the new 
objective of finding "all vectors within a radius" - inherently independent of 
results from another segment (so we can move searches to a more fitting place?)
   
   ### Caveats
   
   I could not find much precedent in using HNSW graphs this way (or even the 
radius-based search for that matter - please add links to existing work if 
someone is aware) and consequently marked all classes as `@lucene.experimental`
   
   For now I have re-used lots of functionality from `AbstractKnnVectorQuery` 
to keep this minimal, but if the use-case is accepted more widely we can look 
into writing more suitable queries (as mentioned above briefly)
   
   ### Next steps
   
   Run benchmarks with this new query to see how it compares to the `topK` 
based search


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to