[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479014#comment-17479014
 ] 

Michael Sokolov edited comment on LUCENE-10382 at 1/19/22, 10:54 PM:
---------------------------------------------------------------------

How would this look? An easy first step is to add a filter parameter to 
KnnVectorQuery

{{  public KnnVectorQuery(String field, float[] target, int k, Bits filter)}}

then it can call {{LeafReader.searchNearestVectors}} with 
{{liveDocs.intersect(filter)}} instead of {{liveDocs.}}

[~julietibs] shared on list a link to a paper showing how the search 
degenerates for highly selective filters. The writers' approach was to fall 
back to "brute force" KNN when selectivity passes a fixed threshold. We could 
do that too, and it makes sense to me, but I guess the question is: where 
should this fallback happen in the API?

The implementation of full (non-approximate) KNN (with a filter) only needs the 
VectorValues iterator which the KnnVectorsReader already provides. It could be 
implemented as part of KnnVectorQuery. Is there a better place?

Hmm, previously [~jpountz] had suggested this API:

{{   searchNearestNeighbors(String field, float[] target, Query filter)}}

taking a {{Query}} rather than a {{Bits }}as a filter. I guess that is more 
high-level, but in my mind a typical use case would be to use this with a 
precomputed filter (as a pre-filter) and then post filter by embedding this 
KnnVectorQuery in another BooleanQuery. And given that we have to compute a 
full bitset in advance anyway, exposing a Query interface seems a bit like 
over-promising. It's clearer to just provide Bits, I think?

Another open question is how to determine the threshold for cutting over to 
full KNN, and whether that will be user configurable at all. Ideally we can 
just pick a percent coverage and make it fixed.

Hmm I just realized another possible concern is that the vectors themselves may 
not be dense in the documents, and that will impact the coverage of the filter 
bits. So to get an accurate coverage number, we'd in theory have to fully 
intersect the KnnVector bits (which docs have vectors in the graph) with the 
filter *and* the liveDocs, and compare the cardinality with that of the graph. 
Although maybe an approximation here is enough - intersect a subset of the bits 
to estimate the total coverage?


was (Author: sokolov):
How would this look? An easy first step is to add a filter parameter to 
KnnVectorQuery 

{{  public KnnVectorQuery(String field, float[] target, int k, Bits filter)}}

then it can call {{LeafReader.searchNearestVectors}} with 
{{liveDocs.intersect(filter)}} instead of {{liveDocs.}}

[~julietibs] shared on list a link to a paper showing how the search 
degenerates for highly selective filters. The writers' approach was to fall 
back to "brute force" KNN when selectivity passes a fixed threshold. We could 
do that too, and it makes sense to me, but I guess the question is: where 
should this fallback happen in the API?

The implementation of full (non-approximate) KNN (with a filter) only needs the 
VectorValues iterator which the KnnVectorsReader already provides. It could be 
implemented as part of KnnVectorQuery. Is there a better place?

> Allow KnnVectorQuery to operate over a subset of liveDocs
> ---------------------------------------------------------
>
>                 Key: LUCENE-10382
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10382
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 9.0
>            Reporter: Joel Bernstein
>            Priority: Major
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to