[ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481465#comment-17481465
 ] 

Julie Tibshirani edited comment on LUCENE-10382 at 1/25/22, 12:20 AM:
----------------------------------------------------------------------

[~sokolov] your multi-step approach makes sense to me. Maybe we could move 
forward with [~jbernste]'s PR as a way to get the query API down. I think we 
should introduce the fallback before releasing the change though. Otherwise 
KnnVectorQuery could easily degrade to brute force performance (or worse!), 
which could really catch users by surprise.

I've been pondering how to automatically choose when to fall back. It indeed 
seems best to compare the cost of HNSW vs. an exact scan, instead of choosing 
an arbitrary cut-off like "filter matches less than 10% of docs". Since the 
cost trade-off depends on data size and other factors, it doesn't seem possible 
to find a constant cutoff that works well across situations. To make it 
practical we might need to expose it as a parameter, which is not very 
user-friendly. So thinking about [~jpountz]'s suggestion for a cost model...
{quote}I wonder if we could develop a cost model of both approaches assuming a 
random distribution of deletions, filter matches and vector values, so that the 
query could compute the cost in both cases for specific values of k, 
Scorer#cost and LeafReader#numDeletedDocs (and maybe some HNSW-specific 
parameters like beamWidth?), and pick the approach that has the lesser cost.
{quote}
I played around with this, looking at HNSW latencies for various filter 
selectivities. It roughly scales with {_}log (n) * 1/p{_}, where _p_ is the 
filter selectivity. (This sort of makes sense: HNSW is roughly logarithmic, but 
it needs to search more and more of the graph as nodes are filtered away.) But 
to compare to brute force, we need a pretty good handle on the scaling 
constants. These are really hard to pin down – they depend on the HNSW build 
parameters, the search parameter k, even properties of the dataset. Looking at 
the HNSW paper, it gives big-O complexity (under some theoretical assumptions) 
and doesn't show the impact of beamWidth or maxConn.

Instead of developing a static cost model, I wonder if we could try to make the 
choice dynamically:
 * If the filter is extremely selective (say it matches fewer than k docs), 
perform an exact scan.
 * Otherwise we always begin an HNSW search, even when the filter is pretty 
selective. If the graph search has already visited more than _p_ of the total 
documents (or some fraction of that), we stop the search and switch to a brute 
force scan. This bounds the overall cost at 2x the exact scan.


was (Author: julietibs):
[~sokolov] your multi-step approach makes sense to me. Maybe we could move 
forward with [~jbernste]'s PR as a way to get the query API down. I think we 
should introduce the fallback before releasing the change though. Otherwise 
KnnVectorQuery could easily degrade to brute force performance (or worse!), 
which could really catch users by surprise.

I've been pondering how to automatically choose when to fall back. It indeed 
seems best to compare the cost of HNSW vs. an exact scan, instead of choosing 
an arbitrary cut-off like "filter matches less than 10% of docs". Since the 
cost trade-off depends on data size and other factors, it doesn't seem possible 
to find a constant cutoff that works well across situations. To make it 
practical we might need to expose it as a parameter, which is not very 
user-friendly. So thinking about [~jpountz]'s suggestion for a cost model...

bq.  I wonder if we could develop a cost model of both approaches assuming a 
random distribution of deletions, filter matches and vector values, so that the 
query could compute the cost in both cases for specific values of k, 
Scorer#cost and LeafReader#numDeletedDocs (and maybe some HNSW-specific 
parameters like beamWidth?), and pick the approach that has the lesser cost.

I played around with this, looking at HNSW latencies for various filter 
selectivities. It roughly scales with _log \(n\) * 1/p_, where _p_ is the 
filter selectivity. (This sort of makes sense: HNSW is roughly logarithmic, but 
it needs to search more and more of the graph as nodes are filtered away.) But 
to compare to brute force, we need a pretty good handle on the scaling 
constants. These are really hard to pin down -- they depend on the HNSW build 
parameters, the search parameter k, even properties of the dataset. Looking at 
the HNSW paper, it gives big-O complexity (under some theoretical assumptions) 
and doesn't show the impact of beamWidth or maxConn.

Instead of developing a static cost model, I wonder if we could try to make the 
choice dynamically:
* If the filter is extremely selective (say it matches fewer than k docs), 
perform a brute-force scan.
* Otherwise we always begin an HNSW search, even when the filter is pretty 
selective. If the graph search has already visited more than _p_ of the total 
documents (or some fraction of that), we stop the search and switch to a brute 
force scan. This bounds the overall cost at 2x the exact scan.

> Allow KnnVectorQuery to operate over a subset of liveDocs
> ---------------------------------------------------------
>
>                 Key: LUCENE-10382
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10382
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 9.0
>            Reporter: Joel Bernstein
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to