Michael Sokolov created LUCENE-9695:
---------------------------------------

             Summary: Don't include deleted documents when merging vectors
                 Key: LUCENE-9695
                 URL: https://issues.apache.org/jira/browse/LUCENE-9695
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Michael Sokolov


While testing HNSW searches with multi-segment indexes, all kinds of strange 
things were happening; recall performance was radically different for a 
force-merged multi-segment index than for the same index built as a single 
segment. Most testing I've done to date has been with single-segment indexes, 
shame on me.

One issue is that when merging we iterate over all the vectors from 0 .. 
size-1. But this size was being calculated without taking deletions into 
account, and this caused deleted vectors to be included in the graph leading to 
exceptions and weird inconsistencies.

The other issue has to do with aliasing in the diverse neighbor selection graph 
construction heuristic introduced recently. Sometimes vectors to be compared 
would be drawn from the same VectorValues, but this is a no-no since they are 
then the same vector (the first one will be overwritten when the second one is 
fetched). This leads to poor results, but not errors per se, but the results 
also became unpredictable in a way that causes the test written to reproduce 
the first issue to fail. Thus I'll include both fixes together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to