[
https://issues.apache.org/jira/browse/LUCENE-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531882#comment-17531882
]
ASF subversion and git services commented on LUCENE-9848:
---------------------------------------------------------
Commit 87255c117d853912dcacfb06f2c208681470343d in lucene's branch
refs/heads/main from Mayya Sharipova
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=87255c117d8 ]
Add change line for LUCENE-9848
> Correctly sort HNSW graph neighbors when applying diversity criterion
> ----------------------------------------------------------------------
>
> Key: LUCENE-9848
> URL: https://issues.apache.org/jira/browse/LUCENE-9848
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael Sokolov
> Assignee: Mayya Sharipova
> Priority: Major
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> When indexing new documents in an HNSW graph, we first find its nearest
> maxConn neighbors (using HNSW search), and then link the new document to this
> neighbors in the graph. These neighbors are filtered using a diversity test.
> The neighbors are added one by one, from most similar to least. Each new
> neighbor is checked against all prior (better) neighbors, and if it is more
> similar to that neighbor than it is to the target document, it is rejected as
> insufficiently diverse.
> When we applied this diversity criterion (rather than simply picking the k
> nearest neighbors), we saw substantial improvements in recall / latency ROC
> curves across several data sets, and it is part of the reference
> implementation, too (where we got it). I believe the impact on indexing
> performance was relatively small; this is a good thing to do, even though it
> is n^2 at its heart, the n remains reasonable due to being bounded by the
> maximum graph fanout parameter, {{maxConn}}.
> Something funny happens when we reach the maximum fanout though. While a new
> document is being linked to its new neighbors, the neighbors are reciprocally
> linked to the new document, until their maximum fanout is reached. At that
> point, the diversity criterion is reapplied to select the neighbors to keep.
> Basically every neighbor is re-checked against every earlier (better)
> neighbor to verify the diversity criterion. This is needed because we
> haven't really maintained the diversity property while adding these
> reciprocal links – the initial neighbors are checked for diversity, which
> often leads to fewer than {{maxConn}} of them being added. Then the new
> documents get linked in without checking, until {{maxConn}} is reached, and
> then diversity is checked again. This is kind of weird, but seems to work.
> But the really strange thing is that when we reject non-diverse documents (in
> HnswGraphBuilder.diversityUpdate), the neighbors are no longer sorted in
> nearness order. I did some rough checks to see if better graphs would result
> from re-sorting (so that when there are non-diverse neighbors, we always
> prefer to drop the worse-scoring one), but it didn't seem to matter all that
> much. But how can that be?
> At any rate this code is funky and hard to understand, and it would probably
> benefit from a second look to see if we can either improve indexing
> performance or improve search performance (by producing better graphs during
> indexing).
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]