[I] Short-circuit HNSW search for similarity-based vector queries? [lucene]

via GitHub Tue, 24 Mar 2026 13:03:58 -0700


kaivalnp opened a new issue, #15869:
URL: https://github.com/apache/lucene/issues/15869


   ### Description
   
   A KNN query [short-circuits the HNSW 
search](https://github.com/apache/lucene/blob/83e3f9ac24ac282ae353d0e0566f64640fe919a3/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java#L345)
 if the "expected" number of nodes visited is >= number of filtered nodes.
   
   A similarity-based vector query (i.e. `[Byte|Float]VectorSimilarityQuery`) 
attempts to find _all_ vectors with a score above a threshold (for Euclidean 
similarity, this can be imagined as all vectors within a radius of the query 
vector).
   
   Assuming document vectors are evenly spread out across the n-dimensional 
space, should vector similarity scores form a normal distribution?
   
   If so, can we estimate the proportion of nodes visited using area under the 
curve (from `resultSimilarity` -> `∞`) of a normal distribution? (and apply the 
same short circuit logic)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Short-circuit HNSW search for similarity-based vector queries? [lucene]

Reply via email to