[ 
https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454894#comment-17454894
 ] 

Julie Tibshirani commented on LUCENE-10040:
-------------------------------------------

I ran some local tests to sanity check the behavior with deletions. First, I 
ran the glove-100-angular benchmark (described in 
https://issues.apache.org/jira/browse/LUCENE-9937), with efConst=100, M=32, 
force-merged to one segment. I randomly deleted 20% of docs and checked the 
performance. Note: recall is measured against the true nearest neighbors from 
the non-deleted set (I made sure to recompute these true neighbors). The 
performance looks good:

{code}
NumCands  Recall   DistComps  QPS
**No deletions**
10        0.459     505       4985.346
50        0.724    1445       2396.820
80        0.776    2053       1758.134
100       0.797    2437       1506.489
500       0.912    9232        418.167
800       0.934   13862        277.354

**20% deletions**
10        0.488     576      4587.963
50        0.741    1695      1966.009
80        0.789    2424      1220.389
100       0.810    2883       337.447
500       0.919   11088       221.782
800       0.941   16643       179.779
{code}

I also played around with challenging cases, where the deleted documents happen 
to be the vectors closest to the query. In the datasets I tried, HNSW seems 
pretty robust in this situation and maintains good recall. Here's a simple test 
that exercises it: https://github.com/apache/lucene/pull/527

I plan to close this out. It'd be great to add benchmarks with deletions to 
luceneutil, but that can be tracked separately.

> Handle deletions in nearest vector search
> -----------------------------------------
>
>                 Key: LUCENE-10040
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10040
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Julie Tibshirani
>            Assignee: Julie Tibshirani
>            Priority: Major
>          Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Currently nearest vector search doesn't account for deleted documents. Even 
> if a document is not in {{LeafReader#getLiveDocs}}, it could still be 
> returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be 
> surprising + difficult for users, since other search APIs account for deleted 
> docs. We've discussed extending the search logic to take a parameter like 
> {{Bits liveDocs}}. This issue discusses options around adding support.
> One approach is to just filter out deleted docs after running the KNN search. 
> This behavior seems hard to work with as a user: fewer than {{k}} docs might 
> come back from your KNN search!
> Alternatively, {{LeafReader#searchNearestVectors}} could always return the 
> {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs 
> while assembling its candidate list. It would traverse further into the 
> graph, visiting more nodes to ensure it gathers the required candidates. 
> (Note deleted docs would still be visited/ traversed). The [hnswlib 
> library|https://github.com/nmslib/hnswlib] contains an implementation like 
> this, where you can mark documents as deleted and they're skipped during 
> search.
> This approach seems reasonable to me, but there are some challenges:
>  * Performance can be unpredictable. If deletions are random, it shouldn't 
> have a huge effect. But in the worst case, a segment could have 50% deleted 
> docs, and they all happen to be near the query vector. HNSW would need to 
> traverse through around half the entire graph to collect neighbors.
>  * As far as I know, there hasn't been academic research or any testing into 
> how well this performs in terms of recall. I have a vague intuition it could 
> be harder to achieve high recall as the algorithm traverses areas further 
> from the "natural" entry points. The HNSW paper doesn't mention deletions/ 
> filtering, and I haven't seen community benchmarks around it.
> Background links:
>  * Thoughts on deletions from the author of the HNSW paper: 
> [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892]
>  * Blog from Vespa team which mentions combining KNN and search filters (very 
> similar to applying deleted docs): 
> [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. 
> The "Exact vs Approximate" section shows good performance even when a large 
> percentage of documents are filtered out. The team mentioned to me they 
> didn't have the chance to measure recall, only latency.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to