[ https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455949#comment-17455949 ]
ASF subversion and git services commented on LUCENE-10040: ---------------------------------------------------------- Commit 394472d4b8e40504f0521df340df446089a7afff in lucene's branch refs/heads/branch_9x from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=394472d ] LUCENE-10040: Add test for vector search with skewed deletions (#527) This exercises a challenging case where the documents to skip all happen to be closest to the query vector. In many cases, HNSW appears to be robust to this case and maintains good recall. > Handle deletions in nearest vector search > ----------------------------------------- > > Key: LUCENE-10040 > URL: https://issues.apache.org/jira/browse/LUCENE-10040 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Julie Tibshirani > Assignee: Julie Tibshirani > Priority: Major > Fix For: 9.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > Currently nearest vector search doesn't account for deleted documents. Even > if a document is not in {{LeafReader#getLiveDocs}}, it could still be > returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be > surprising + difficult for users, since other search APIs account for deleted > docs. We've discussed extending the search logic to take a parameter like > {{Bits liveDocs}}. This issue discusses options around adding support. > One approach is to just filter out deleted docs after running the KNN search. > This behavior seems hard to work with as a user: fewer than {{k}} docs might > come back from your KNN search! > Alternatively, {{LeafReader#searchNearestVectors}} could always return the > {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs > while assembling its candidate list. It would traverse further into the > graph, visiting more nodes to ensure it gathers the required candidates. > (Note deleted docs would still be visited/ traversed). The [hnswlib > library|https://github.com/nmslib/hnswlib] contains an implementation like > this, where you can mark documents as deleted and they're skipped during > search. > This approach seems reasonable to me, but there are some challenges: > * Performance can be unpredictable. If deletions are random, it shouldn't > have a huge effect. But in the worst case, a segment could have 50% deleted > docs, and they all happen to be near the query vector. HNSW would need to > traverse through around half the entire graph to collect neighbors. > * As far as I know, there hasn't been academic research or any testing into > how well this performs in terms of recall. I have a vague intuition it could > be harder to achieve high recall as the algorithm traverses areas further > from the "natural" entry points. The HNSW paper doesn't mention deletions/ > filtering, and I haven't seen community benchmarks around it. > Background links: > * Thoughts on deletions from the author of the HNSW paper: > [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892] > * Blog from Vespa team which mentions combining KNN and search filters (very > similar to applying deleted docs): > [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. > The "Exact vs Approximate" section shows good performance even when a large > percentage of documents are filtered out. The team mentioned to me they > didn't have the chance to measure recall, only latency. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org