[ https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alessandro Benedetti updated LUCENE-10040: ------------------------------------------ Labels: vector-based-search (was: ) > Handle deletions in nearest vector search > ----------------------------------------- > > Key: LUCENE-10040 > URL: https://issues.apache.org/jira/browse/LUCENE-10040 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Julie Tibshirani > Assignee: Julie Tibshirani > Priority: Major > Labels: vector-based-search > Fix For: 9.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > Currently nearest vector search doesn't account for deleted documents. Even > if a document is not in {{LeafReader#getLiveDocs}}, it could still be > returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be > surprising + difficult for users, since other search APIs account for deleted > docs. We've discussed extending the search logic to take a parameter like > {{Bits liveDocs}}. This issue discusses options around adding support. > One approach is to just filter out deleted docs after running the KNN search. > This behavior seems hard to work with as a user: fewer than {{k}} docs might > come back from your KNN search! > Alternatively, {{LeafReader#searchNearestVectors}} could always return the > {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs > while assembling its candidate list. It would traverse further into the > graph, visiting more nodes to ensure it gathers the required candidates. > (Note deleted docs would still be visited/ traversed). The [hnswlib > library|https://github.com/nmslib/hnswlib] contains an implementation like > this, where you can mark documents as deleted and they're skipped during > search. > This approach seems reasonable to me, but there are some challenges: > * Performance can be unpredictable. If deletions are random, it shouldn't > have a huge effect. But in the worst case, a segment could have 50% deleted > docs, and they all happen to be near the query vector. HNSW would need to > traverse through around half the entire graph to collect neighbors. > * As far as I know, there hasn't been academic research or any testing into > how well this performs in terms of recall. I have a vague intuition it could > be harder to achieve high recall as the algorithm traverses areas further > from the "natural" entry points. The HNSW paper doesn't mention deletions/ > filtering, and I haven't seen community benchmarks around it. > Background links: > * Thoughts on deletions from the author of the HNSW paper: > [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892] > * Blog from Vespa team which mentions combining KNN and search filters (very > similar to applying deleted docs): > [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. > The "Exact vs Approximate" section shows good performance even when a large > percentage of documents are filtered out. The team mentioned to me they > didn't have the chance to measure recall, only latency. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org