Michael Sokolov created LUCENE-9848:
---------------------------------------

             Summary: Correctly sort HNSW graph neighbors when applying 
diversity criterion 
                 Key: LUCENE-9848
                 URL: https://issues.apache.org/jira/browse/LUCENE-9848
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Michael Sokolov


When indexing new documents in an HNSW graph, we first find its nearest maxConn 
neighbors (using HNSW search), and then link the new document to this neighbors 
in the graph. These neighbors are filtered using a diversity test. The 
neighbors are added one by one, from most similar to least. Each new neighbor 
is checked against all prior (better) neighbors, and if it is more similar to 
that neighbor than it is to the target document, it is rejected as 
insufficiently diverse.

When we applied this diversity criterion (rather than simply picking the k 
nearest neighbors), we saw substantial improvements in recall / latency ROC 
curves across several data sets, and it is part of the reference 
implementation, too (where we got it). I believe the impact on indexing 
performance was relatively small; this is a good thing to do, even though it is 
n^2 at its heart, the n remains reasonable due to being bounded by the maximum 
graph fanout parameter, {{maxConn}}. 

Something funny happens when we reach the maximum fanout though. While a new 
document is being linked to its new neighbors, the neighbors are reciprocally 
linked to the new document, until their maximum fanout is reached. At that 
point, the diversity criterion is reapplied to select the neighbors to keep. 
Basically every neighbor is re-checked against every earlier (better) neighbor 
to verify the diversity criterion.  This is needed because we haven't really 
maintained the diversity property while adding these reciprocal links – the 
initial neighbors are checked for diversity, which often leads to fewer than 
{{maxConn}} of them being added. Then the new documents get linked in without 
checking, until {{maxConn}} is reached, and then diversity is checked again. 
This is kind of weird, but seems to work.

But the really strange thing is that when we reject non-diverse documents (in 
HnswGraphBuilder.diversityUpdate), the neighbors are no longer sorted in 
nearness order. I did some rough checks to see if better graphs would result 
from re-sorting (so that when there are non-diverse neighbors, we always prefer 
to drop the worse-scoring one), but it didn't seem to matter all that much. But 
how can that be?

At any rate this code is funky and hard to understand, and it would probably 
benefit from a second look to see if we can either improve indexing performance 
or improve search performance (by producing better graphs during indexing).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to