kaivalnp opened a new issue, #15869: URL: https://github.com/apache/lucene/issues/15869
### Description A KNN query [short-circuits the HNSW search](https://github.com/apache/lucene/blob/83e3f9ac24ac282ae353d0e0566f64640fe919a3/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java#L345) if the "expected" number of nodes visited is >= number of filtered nodes. A similarity-based vector query (i.e. `[Byte|Float]VectorSimilarityQuery`) attempts to find _all_ vectors with a score above a threshold (for Euclidean similarity, this can be imagined as all vectors within a radius of the query vector). Assuming document vectors are evenly spread out across the n-dimensional space, should vector similarity scores form a normal distribution? If so, can we estimate the proportion of nodes visited using area under the curve (from `resultSimilarity` -> `∞`) of a normal distribution? (and apply the same short circuit logic) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
