[ 
https://issues.apache.org/jira/browse/LUCENE-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayya Sharipova closed LUCENE-9848.
-----------------------------------

> Correctly sort HNSW graph neighbors when applying diversity criterion 
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-9848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9848
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Assignee: Mayya Sharipova
>            Priority: Major
>             Fix For: 9.2
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When indexing new documents in an HNSW graph, we first find its nearest 
> maxConn neighbors (using HNSW search), and then link the new document to this 
> neighbors in the graph. These neighbors are filtered using a diversity test. 
> The neighbors are added one by one, from most similar to least. Each new 
> neighbor is checked against all prior (better) neighbors, and if it is more 
> similar to that neighbor than it is to the target document, it is rejected as 
> insufficiently diverse.
> When we applied this diversity criterion (rather than simply picking the k 
> nearest neighbors), we saw substantial improvements in recall / latency ROC 
> curves across several data sets, and it is part of the reference 
> implementation, too (where we got it). I believe the impact on indexing 
> performance was relatively small; this is a good thing to do, even though it 
> is n^2 at its heart, the n remains reasonable due to being bounded by the 
> maximum graph fanout parameter, {{maxConn}}. 
> Something funny happens when we reach the maximum fanout though. While a new 
> document is being linked to its new neighbors, the neighbors are reciprocally 
> linked to the new document, until their maximum fanout is reached. At that 
> point, the diversity criterion is reapplied to select the neighbors to keep. 
> Basically every neighbor is re-checked against every earlier (better) 
> neighbor to verify the diversity criterion.  This is needed because we 
> haven't really maintained the diversity property while adding these 
> reciprocal links – the initial neighbors are checked for diversity, which 
> often leads to fewer than {{maxConn}} of them being added. Then the new 
> documents get linked in without checking, until {{maxConn}} is reached, and 
> then diversity is checked again. This is kind of weird, but seems to work.
> But the really strange thing is that when we reject non-diverse documents (in 
> HnswGraphBuilder.diversityUpdate), the neighbors are no longer sorted in 
> nearness order. I did some rough checks to see if better graphs would result 
> from re-sorting (so that when there are non-diverse neighbors, we always 
> prefer to drop the worse-scoring one), but it didn't seem to matter all that 
> much. But how can that be?
> At any rate this code is funky and hard to understand, and it would probably 
> benefit from a second look to see if we can either improve indexing 
> performance or improve search performance (by producing better graphs during 
> indexing).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to