[ https://issues.apache.org/jira/browse/LUCENE-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531882#comment-17531882 ]
ASF subversion and git services commented on LUCENE-9848: --------------------------------------------------------- Commit 87255c117d853912dcacfb06f2c208681470343d in lucene's branch refs/heads/main from Mayya Sharipova [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=87255c117d8 ] Add change line for LUCENE-9848 > Correctly sort HNSW graph neighbors when applying diversity criterion > ---------------------------------------------------------------------- > > Key: LUCENE-9848 > URL: https://issues.apache.org/jira/browse/LUCENE-9848 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael Sokolov > Assignee: Mayya Sharipova > Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > When indexing new documents in an HNSW graph, we first find its nearest > maxConn neighbors (using HNSW search), and then link the new document to this > neighbors in the graph. These neighbors are filtered using a diversity test. > The neighbors are added one by one, from most similar to least. Each new > neighbor is checked against all prior (better) neighbors, and if it is more > similar to that neighbor than it is to the target document, it is rejected as > insufficiently diverse. > When we applied this diversity criterion (rather than simply picking the k > nearest neighbors), we saw substantial improvements in recall / latency ROC > curves across several data sets, and it is part of the reference > implementation, too (where we got it). I believe the impact on indexing > performance was relatively small; this is a good thing to do, even though it > is n^2 at its heart, the n remains reasonable due to being bounded by the > maximum graph fanout parameter, {{maxConn}}. > Something funny happens when we reach the maximum fanout though. While a new > document is being linked to its new neighbors, the neighbors are reciprocally > linked to the new document, until their maximum fanout is reached. At that > point, the diversity criterion is reapplied to select the neighbors to keep. > Basically every neighbor is re-checked against every earlier (better) > neighbor to verify the diversity criterion. This is needed because we > haven't really maintained the diversity property while adding these > reciprocal links – the initial neighbors are checked for diversity, which > often leads to fewer than {{maxConn}} of them being added. Then the new > documents get linked in without checking, until {{maxConn}} is reached, and > then diversity is checked again. This is kind of weird, but seems to work. > But the really strange thing is that when we reject non-diverse documents (in > HnswGraphBuilder.diversityUpdate), the neighbors are no longer sorted in > nearness order. I did some rough checks to see if better graphs would result > from re-sorting (so that when there are non-diverse neighbors, we always > prefer to drop the worse-scoring one), but it didn't seem to matter all that > much. But how can that be? > At any rate this code is funky and hard to understand, and it would probably > benefit from a second look to see if we can either improve indexing > performance or improve search performance (by producing better graphs during > indexing). -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org