kaivalnp opened a new issue, #12579: URL: https://github.com/apache/lucene/issues/12579
### Context Almost all [vector search algorithms](https://ann-benchmarks.com/index.html#algorithms) focus on getting the `topK` results for a given query vector. This however, may not be the best objective to pursue because: - Some queries are more suited for vector search: Indexed vectors may not be uniformly distributed, with some parts of the vector space denser than others. If the query is present in a sparse part of the space, the `topK` results will be low scoring (and not relevant to the query). We may want some kind of score `threshold`, below which a vector is not considered as a "match"? - Conversely, if the query is present in a dense part of the vector space, we may not want to limit vector search to `topK` when we have some more (slightly lower scoring) results, but still above the `threshold`. Considering a query and a vector as a "match" if their similarity score is above the `threshold`, we may want ALL possible matches? I found the same thing being discussed in #11946 as well.. However, I'm not aware of any algorithms designed for this objective (or how common of a use-case this may be) ### Proposal **IF** this turns out to be a valid use-case, HNSW may not be the best algorithm for this new objective (as it was designed for getting the K-Nearest Neighbors). On a high-level, the current algorithm is: - Maintain a queue of `topK` results (initially empty) and an unbounded queue of candidates (initially containing the entry node). Both of these are priority queues in descending order of similarity scores - Starting from the best (highest scoring) candidate, we insert it into the result queue if there are (1) fewer than `topK` results (2) it is better scoring than any existing result (also removing the least scoring result in this case) - If a node is added as a result, consider all its neighbors as further candidates - The search stops when the best scoring candidate is worse than the least scoring result For the new objective, the underlying graphs created by HNSW may still be useful - where each node is connected to its nearest (and diverse) neighbors, so we have some sense of direction / ordering on which node to explore further Can we instead change the graph traversal (and result collection) criteria such that: - Consider a node as a candidate if it exceeds a lower, graph-traversal specific score threshold - Consider a candidate as a result if it exceeds the actual, query-result similarity score threshold The lower graph-traversal threshold acts as a buffer, to retain results where all nodes along its path may not be above the actual `threshold` (and is treated as a search-time hyper-parameter, chosen according to the underlying vector space and queries) The upper levels of search will remain the same (find the single-best entry point for the next layer) and this new criteria is only applicable to the lowest layer (with all nodes) to collect an unbounded number of results above the `threshold` This will also eliminate the need to maintain results and candidates as priority queues, instead keeping them as simple lists (reducing the overhead of `heapify` calls that maintain the min-max heap property) and just sort them once at the end **Looking for inputs on both the use-case and tweaked algorithm..** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org