[ https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454894#comment-17454894 ]
Julie Tibshirani commented on LUCENE-10040: ------------------------------------------- I ran some local tests to sanity check the behavior with deletions. First, I ran the glove-100-angular benchmark (described in https://issues.apache.org/jira/browse/LUCENE-9937), with efConst=100, M=32, force-merged to one segment. I randomly deleted 20% of docs and checked the performance. Note: recall is measured against the true nearest neighbors from the non-deleted set (I made sure to recompute these true neighbors). The performance looks good: {code} NumCands Recall DistComps QPS **No deletions** 10 0.459 505 4985.346 50 0.724 1445 2396.820 80 0.776 2053 1758.134 100 0.797 2437 1506.489 500 0.912 9232 418.167 800 0.934 13862 277.354 **20% deletions** 10 0.488 576 4587.963 50 0.741 1695 1966.009 80 0.789 2424 1220.389 100 0.810 2883 337.447 500 0.919 11088 221.782 800 0.941 16643 179.779 {code} I also played around with challenging cases, where the deleted documents happen to be the vectors closest to the query. In the datasets I tried, HNSW seems pretty robust in this situation and maintains good recall. Here's a simple test that exercises it: https://github.com/apache/lucene/pull/527 I plan to close this out. It'd be great to add benchmarks with deletions to luceneutil, but that can be tracked separately. > Handle deletions in nearest vector search > ----------------------------------------- > > Key: LUCENE-10040 > URL: https://issues.apache.org/jira/browse/LUCENE-10040 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Julie Tibshirani > Assignee: Julie Tibshirani > Priority: Major > Time Spent: 5.5h > Remaining Estimate: 0h > > Currently nearest vector search doesn't account for deleted documents. Even > if a document is not in {{LeafReader#getLiveDocs}}, it could still be > returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be > surprising + difficult for users, since other search APIs account for deleted > docs. We've discussed extending the search logic to take a parameter like > {{Bits liveDocs}}. This issue discusses options around adding support. > One approach is to just filter out deleted docs after running the KNN search. > This behavior seems hard to work with as a user: fewer than {{k}} docs might > come back from your KNN search! > Alternatively, {{LeafReader#searchNearestVectors}} could always return the > {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs > while assembling its candidate list. It would traverse further into the > graph, visiting more nodes to ensure it gathers the required candidates. > (Note deleted docs would still be visited/ traversed). The [hnswlib > library|https://github.com/nmslib/hnswlib] contains an implementation like > this, where you can mark documents as deleted and they're skipped during > search. > This approach seems reasonable to me, but there are some challenges: > * Performance can be unpredictable. If deletions are random, it shouldn't > have a huge effect. But in the worst case, a segment could have 50% deleted > docs, and they all happen to be near the query vector. HNSW would need to > traverse through around half the entire graph to collect neighbors. > * As far as I know, there hasn't been academic research or any testing into > how well this performs in terms of recall. I have a vague intuition it could > be harder to achieve high recall as the algorithm traverses areas further > from the "natural" entry points. The HNSW paper doesn't mention deletions/ > filtering, and I haven't seen community benchmarks around it. > Background links: > * Thoughts on deletions from the author of the HNSW paper: > [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892] > * Blog from Vespa team which mentions combining KNN and search filters (very > similar to applying deleted docs): > [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. > The "Exact vs Approximate" section shows good performance even when a large > percentage of documents are filtered out. The team mentioned to me they > didn't have the chance to measure recall, only latency. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org