[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479014#comment-17479014 ]
Michael Sokolov edited comment on LUCENE-10382 at 1/19/22, 10:54 PM: --------------------------------------------------------------------- How would this look? An easy first step is to add a filter parameter to KnnVectorQuery {{ public KnnVectorQuery(String field, float[] target, int k, Bits filter)}} then it can call {{LeafReader.searchNearestVectors}} with {{liveDocs.intersect(filter)}} instead of {{liveDocs.}} [~julietibs] shared on list a link to a paper showing how the search degenerates for highly selective filters. The writers' approach was to fall back to "brute force" KNN when selectivity passes a fixed threshold. We could do that too, and it makes sense to me, but I guess the question is: where should this fallback happen in the API? The implementation of full (non-approximate) KNN (with a filter) only needs the VectorValues iterator which the KnnVectorsReader already provides. It could be implemented as part of KnnVectorQuery. Is there a better place? Hmm, previously [~jpountz] had suggested this API: {{ searchNearestNeighbors(String field, float[] target, Query filter)}} taking a {{Query}} rather than a {{Bits }}as a filter. I guess that is more high-level, but in my mind a typical use case would be to use this with a precomputed filter (as a pre-filter) and then post filter by embedding this KnnVectorQuery in another BooleanQuery. And given that we have to compute a full bitset in advance anyway, exposing a Query interface seems a bit like over-promising. It's clearer to just provide Bits, I think? Another open question is how to determine the threshold for cutting over to full KNN, and whether that will be user configurable at all. Ideally we can just pick a percent coverage and make it fixed. Hmm I just realized another possible concern is that the vectors themselves may not be dense in the documents, and that will impact the coverage of the filter bits. So to get an accurate coverage number, we'd in theory have to fully intersect the KnnVector bits (which docs have vectors in the graph) with the filter *and* the liveDocs, and compare the cardinality with that of the graph. Although maybe an approximation here is enough - intersect a subset of the bits to estimate the total coverage? was (Author: sokolov): How would this look? An easy first step is to add a filter parameter to KnnVectorQuery {{ public KnnVectorQuery(String field, float[] target, int k, Bits filter)}} then it can call {{LeafReader.searchNearestVectors}} with {{liveDocs.intersect(filter)}} instead of {{liveDocs.}} [~julietibs] shared on list a link to a paper showing how the search degenerates for highly selective filters. The writers' approach was to fall back to "brute force" KNN when selectivity passes a fixed threshold. We could do that too, and it makes sense to me, but I guess the question is: where should this fallback happen in the API? The implementation of full (non-approximate) KNN (with a filter) only needs the VectorValues iterator which the KnnVectorsReader already provides. It could be implemented as part of KnnVectorQuery. Is there a better place? > Allow KnnVectorQuery to operate over a subset of liveDocs > --------------------------------------------------------- > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement > Affects Versions: 9.0 > Reporter: Joel Bernstein > Priority: Major > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org