Michael Sokolov created LUCENE-9695:
---------------------------------------
Summary: Don't include deleted documents when merging vectors
Key: LUCENE-9695
URL: https://issues.apache.org/jira/browse/LUCENE-9695
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael Sokolov
While testing HNSW searches with multi-segment indexes, all kinds of strange
things were happening; recall performance was radically different for a
force-merged multi-segment index than for the same index built as a single
segment. Most testing I've done to date has been with single-segment indexes,
shame on me.
One issue is that when merging we iterate over all the vectors from 0 ..
size-1. But this size was being calculated without taking deletions into
account, and this caused deleted vectors to be included in the graph leading to
exceptions and weird inconsistencies.
The other issue has to do with aliasing in the diverse neighbor selection graph
construction heuristic introduced recently. Sometimes vectors to be compared
would be drawn from the same VectorValues, but this is a no-no since they are
then the same vector (the first one will be overwritten when the second one is
fetched). This leads to poor results, but not errors per se, but the results
also became unpredictable in a way that causes the test written to reproduce
the first issue to fail. Thus I'll include both fixes together.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]