[GitHub] [lucene] kaivalnp opened a new issue, #12579: [DISCUSS] Should there be a threshold-based vector search API?

via GitHub Wed, 20 Sep 2023 09:26:07 -0700


kaivalnp opened a new issue, #12579:
URL: https://github.com/apache/lucene/issues/12579


   ### Context
   
   Almost all [vector search 
algorithms](https://ann-benchmarks.com/index.html#algorithms) focus on getting 
the `topK` results for a given query vector. This however, may not be the best 
objective to pursue because:
   
   - Some queries are more suited for vector search: Indexed vectors may not be 
uniformly distributed, with some parts of the vector space denser than others. 
If the query is present in a sparse part of the space, the `topK` results will 
be low scoring (and not relevant to the query). We may want some kind of score 
`threshold`, below which a vector is not considered as a "match"?
   - Conversely, if the query is present in a dense part of the vector space, 
we may not want to limit vector search to `topK` when we have some more 
(slightly lower scoring) results, but still above the `threshold`. Considering 
a query and a vector as a "match" if their similarity score is above the 
`threshold`, we may want ALL possible matches?
   
   I found the same thing being discussed in #11946 as well.. However, I'm not 
aware of any algorithms designed for this objective (or how common of a 
use-case this may be)
   
   ### Proposal
   
   **IF** this turns out to be a valid use-case, HNSW may not be the best 
algorithm for this new objective (as it was designed for getting the K-Nearest 
Neighbors). On a high-level, the current algorithm is:
   - Maintain a queue of `topK` results (initially empty) and an unbounded 
queue of candidates (initially containing the entry node). Both of these are 
priority queues in descending order of similarity scores
   - Starting from the best (highest scoring) candidate, we insert it into the 
result queue if there are (1) fewer than `topK` results (2) it is better 
scoring than any existing result (also removing the least scoring result in 
this case)
   - If a node is added as a result, consider all its neighbors as further 
candidates
   - The search stops when the best scoring candidate is worse than the least 
scoring result
   
   For the new objective, the underlying graphs created by HNSW may still be 
useful - where each node is connected to its nearest (and diverse) neighbors, 
so we have some sense of direction / ordering on which node to explore further
   
   Can we instead change the graph traversal (and result collection) criteria 
such that:
   - Consider a node as a candidate if it exceeds a lower, graph-traversal 
specific score threshold
   - Consider a candidate as a result if it exceeds the actual, query-result 
similarity score threshold
   
   The lower graph-traversal threshold acts as a buffer, to retain results 
where all nodes along its path may not be above the actual `threshold` (and is 
treated as a search-time hyper-parameter, chosen according to the underlying 
vector space and queries)
   
   The upper levels of search will remain the same (find the single-best entry 
point for the next layer) and this new criteria is only applicable to the 
lowest layer (with all nodes) to collect an unbounded number of results above 
the `threshold`
   
   This will also eliminate the need to maintain results and candidates as 
priority queues, instead keeping them as simple lists (reducing the overhead of 
`heapify` calls that maintain the min-max heap property) and just sort them 
once at the end
   
   **Looking for inputs on both the use-case and tweaked algorithm..**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] kaivalnp opened a new issue, #12579: [DISCUSS] Should there be a threshold-based vector search API?

Reply via email to