Re: [PR] Add support for similarity-based vector searches [lucene]

via GitHub Fri, 10 Nov 2023 10:13:48 -0800


kaivalnp commented on PR #12679:
URL: https://github.com/apache/lucene/pull/12679#issuecomment-1806196196


   Summary of new changes:
   1. Refactor into a more appropriate query
       - Move away from `AbstractKnnVectorQuery` to take advantage of inherent 
independence of segment-level results
       - KNN queries need to execute the core logic in `#rewrite` because of an 
inter-dependence of segment-level results (that is, given N segment-level hits 
we cannot determine if they will appear in the index-level `topK` without 
knowing results from other segments). This leads to requirements of [custom 
concurrency](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L82-L88)
 for individual HNSW searches, which should ideally be parallel by default
       - We can move graph searches down to a more appropriate place (like 
`#scorer`) to take advantage of this
   
   2. Return a lazy-loading iterator instead of a greedy exact search (thanks 
@jpountz!)
       - Introduce a `visitLimit` on the number of nodes to traverse before 
stopping graph search - deeming it "too expensive". Once this is exhausted, 
return a lazy-loading iterator on all vectors (functionally equivalent to an 
exact search)
       - Unlike KNN queries, which need to traverse all vectors to determine 
which ones are present in the `topK` best-scoring ones, a similarity-based 
vector search can independently determine if a vector is a result or not (based 
on whether its similarity with the query is above a `resultSimilarity`)
       - Making use of this behavior, we can prevent a greedy exact search for 
collecting all matching docs into a list on heap, and determine if a vector is 
a match inside a `FilteredDocIdSetIterator`
       - This has a huge benefit when the query will be one of the clauses of a 
`BooleanQuery` (so other clauses will filter out non-matching docs and this 
query will only compute similarity scores with already filtered vectors). In 
the worst case, this will consider all vectors (same as exact search)
       - We also have useful information from graph search - mainly which hit 
was evaluated, and which hit was collected. This information can be re-used 
from the iterator: if a hit has been traversed, it will either be added to the 
results, or discarded. If it is present in the results, we simply lookup the 
score, otherwise mark it as rejected
       - If a vector has not been traversed in graph search, we compute its 
similarity score (so each query - document pair will only compute similarity 
scores once)
   
   Please let me know if this approach makes sense?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add support for similarity-based vector searches [lucene]

Reply via email to