[ https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481465#comment-17481465 ]
Julie Tibshirani edited comment on LUCENE-10382 at 1/25/22, 12:21 AM: ---------------------------------------------------------------------- [~sokolov] your multi-step approach makes sense to me. Maybe we could move forward with [~jbernste]'s PR as a way to get the query API down. I think we should introduce the fallback before releasing the change though. Otherwise KnnVectorQuery could easily degrade to brute force performance (or worse!), which could really catch users by surprise. I've been pondering how to automatically choose when to fall back. It indeed seems best to compare the cost of HNSW vs. an exact scan, instead of choosing an arbitrary cut-off like "filter matches less than 10% of docs". Since the cost trade-off depends on data size and other factors, it doesn't seem possible to find a constant cutoff that works well across situations. To make it practical we might need to expose it as a parameter, which is not very user-friendly. So thinking about [~jpountz]'s suggestion for a cost model... {quote}I wonder if we could develop a cost model of both approaches assuming a random distribution of deletions, filter matches and vector values, so that the query could compute the cost in both cases for specific values of k, Scorer#cost and LeafReader#numDeletedDocs (and maybe some HNSW-specific parameters like beamWidth?), and pick the approach that has the lesser cost. {quote} I played around with this, looking at HNSW latencies for various filter selectivities. It roughly scales with _log \(n\) * 1/p_, where _p_ is the filter selectivity. (This sort of makes sense: HNSW is roughly logarithmic, but it needs to search more and more of the graph as nodes are filtered away.) But to compare to brute force, we need a pretty good handle on the scaling constants. These are really hard to pin down – they depend on the HNSW build parameters, the search parameter k, even properties of the dataset. Looking at the HNSW paper, it gives big-O complexity (under some theoretical assumptions) and doesn't show the impact of beamWidth or maxConn. Instead of developing a static cost model, I wonder if we could try to make the choice dynamically: * If the filter is extremely selective (say it matches fewer than k docs), perform an exact scan. * Otherwise we always begin an HNSW search, even when the filter is pretty selective. If the graph search has already visited more than _p_ of the total documents (or some fraction of that), we stop the search and switch to a brute force scan. This bounds the overall cost at 2x the exact scan. was (Author: julietibs): [~sokolov] your multi-step approach makes sense to me. Maybe we could move forward with [~jbernste]'s PR as a way to get the query API down. I think we should introduce the fallback before releasing the change though. Otherwise KnnVectorQuery could easily degrade to brute force performance (or worse!), which could really catch users by surprise. I've been pondering how to automatically choose when to fall back. It indeed seems best to compare the cost of HNSW vs. an exact scan, instead of choosing an arbitrary cut-off like "filter matches less than 10% of docs". Since the cost trade-off depends on data size and other factors, it doesn't seem possible to find a constant cutoff that works well across situations. To make it practical we might need to expose it as a parameter, which is not very user-friendly. So thinking about [~jpountz]'s suggestion for a cost model... {quote}I wonder if we could develop a cost model of both approaches assuming a random distribution of deletions, filter matches and vector values, so that the query could compute the cost in both cases for specific values of k, Scorer#cost and LeafReader#numDeletedDocs (and maybe some HNSW-specific parameters like beamWidth?), and pick the approach that has the lesser cost. {quote} I played around with this, looking at HNSW latencies for various filter selectivities. It roughly scales with {_}log (n) * 1/p{_}, where _p_ is the filter selectivity. (This sort of makes sense: HNSW is roughly logarithmic, but it needs to search more and more of the graph as nodes are filtered away.) But to compare to brute force, we need a pretty good handle on the scaling constants. These are really hard to pin down – they depend on the HNSW build parameters, the search parameter k, even properties of the dataset. Looking at the HNSW paper, it gives big-O complexity (under some theoretical assumptions) and doesn't show the impact of beamWidth or maxConn. Instead of developing a static cost model, I wonder if we could try to make the choice dynamically: * If the filter is extremely selective (say it matches fewer than k docs), perform an exact scan. * Otherwise we always begin an HNSW search, even when the filter is pretty selective. If the graph search has already visited more than _p_ of the total documents (or some fraction of that), we stop the search and switch to a brute force scan. This bounds the overall cost at 2x the exact scan. > Allow KnnVectorQuery to operate over a subset of liveDocs > --------------------------------------------------------- > > Key: LUCENE-10382 > URL: https://issues.apache.org/jira/browse/LUCENE-10382 > Project: Lucene - Core > Issue Type: Improvement > Affects Versions: 9.0 > Reporter: Joel Bernstein > Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently the KnnVectorQuery selects the top K vectors from all live docs. > This ticket will change the interface to make it possible for the top K > vectors to be selected from a subset of the live docs. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org