kaivalnp commented on PR #12679:
URL: https://github.com/apache/lucene/pull/12679#issuecomment-1806196196
Summary of new changes:
1. Refactor into a more appropriate query
- Move away from `AbstractKnnVectorQuery` to take advantage of inherent
independence of segment-level results
- KNN queries need to execute the core logic in `#rewrite` because of an
inter-dependence of segment-level results (that is, given N segment-level hits
we cannot determine if they will appear in the index-level `topK` without
knowing results from other segments). This leads to requirements of [custom
concurrency](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L82-L88)
for individual HNSW searches, which should ideally be parallel by default
- We can move graph searches down to a more appropriate place (like
`#scorer`) to take advantage of this
2. Return a lazy-loading iterator instead of a greedy exact search (thanks
@jpountz!)
- Introduce a `visitLimit` on the number of nodes to traverse before
stopping graph search - deeming it "too expensive". Once this is exhausted,
return a lazy-loading iterator on all vectors (functionally equivalent to an
exact search)
- Unlike KNN queries, which need to traverse all vectors to determine
which ones are present in the `topK` best-scoring ones, a similarity-based
vector search can independently determine if a vector is a result or not (based
on whether its similarity with the query is above a `resultSimilarity`)
- Making use of this behavior, we can prevent a greedy exact search for
collecting all matching docs into a list on heap, and determine if a vector is
a match inside a `FilteredDocIdSetIterator`
- This has a huge benefit when the query will be one of the clauses of a
`BooleanQuery` (so other clauses will filter out non-matching docs and this
query will only compute similarity scores with already filtered vectors). In
the worst case, this will consider all vectors (same as exact search)
- We also have useful information from graph search - mainly which hit
was evaluated, and which hit was collected. This information can be re-used
from the iterator: if a hit has been traversed, it will either be added to the
results, or discarded. If it is present in the results, we simply lookup the
score, otherwise mark it as rejected
- If a vector has not been traversed in graph search, we compute its
similarity score (so each query - document pair will only compute similarity
scores once)
Please let me know if this approach makes sense?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]