kaivalnp commented on PR #12679:
URL: https://github.com/apache/lucene/pull/12679#issuecomment-1768862106

   > the Collector is full by flagging "incomplete" (I think this is possible) 
once a threshold is reached
   
   Do you mean that we return incomplete results?
   
   Instead, maybe we can:
   1. Ask for a sane limit on the number of nodes to visit from the user
   2. If this limit is reached (possibly when the supplied `traversalThreshold` 
is too low), then we break out of HNSW search
   3. Now instead of performing a [greedy 
`#exactSearch`](https://github.com/kaivalnp/lucene/blob/radius-based-vector-search/lucene/core/src/java/org/apache/lucene/search/AbstractRnnVectorQuery.java#L53-L74)
 and collecting everything into a list, we return a `TwoPhaseIterator` where 
the 
[`#matches`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TwoPhaseIterator.java#L112)
 call performs the underlying dot product comparison and returns `true` or 
`false` based on whether the computed score is above the `resultThreshold`
   4. This way, we can perform an "exact search" lazily, and only compute 
vector similarity on required documents (for example: if this query is a child 
of some `BooleanQuery`, then the actual number of documents for which we'll 
need to compute similarity is greatly reduced). The worst case will still be an 
exact search on all documents
   
   This "lazy-loading" works very well for our use case because the fact that a 
vector matches our query or not is independent of other vectors (unlike in 
K-NN, where given a query and an arbitrary doc vector, we cannot say whether 
the doc vector will be in the `topK` results of the query)
   
   Is this what you had in mind earlier @jpountz?
   
   > I will try and replicate with Lucene Util.
   
   Yes, I took inspiration from 
[`KnnGraphTester`](https://github.com/mikemccand/luceneutil/blob/master/src/main/KnnGraphTester.java)
 to write a local benchmark, but may have made some silly mistakes. It'll be 
good to get an independent set of benchmark results..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to