msokolov commented on PR #13566: URL: https://github.com/apache/lucene/pull/13566#issuecomment-2275670617
I posted an updated patch that uses an approach that is more aligned with how we search. The idea is we don't need to care about strongly-connected graphs; we just want to make sure there is a path to every document via search. This means every document can be reached from the entry point, which is a single node on the top level. So this patch tries to ensure that every node on each level is reachable from one of the entry points of that level (which are just the nodes on the next higher level). This is easy to compute and seems to correspond to the property we would like to guarantee: no orphaned nodes. The patch also has a stronger approach to linking up disconnected components. Previously we would check the closest N nodes and if none of them could be linked (ie they were all full), we would give up. This patch tracks all the non-full nodes and uses that bitset as a filter on the search, so if there are any, it will find them. It also does not insist on bidirectional linking, which isn't necessary to get the property of reachability. The results do show that this approach does a good (better than before) job of linking about disconnected components. You have to look hard to see a measurable impact on recall. Latency and indexing time don't seem to be affected In this test I measured with low values of M in order to get disconnected graphs. This test used 1M 256-dim vectors from an AMZ internal data set. I forced-merge but did not use concurrent merging (an oversight, it should work). A bunch of other tests showed improvements in connectedness, but not any significant change in other metrics. Connectedness reported here is what KnnGraphTester prints - it's basically % of nodes reachable from a single node (the entryPoint node) on each level. Not exactly the same as the criteria being optimized, but it is closely correlated. || test || connectedness || recall || latency || index time || M || beam || candidate | 1 0.98 0.98 0.97 0.96 1 | 0.860 | 0.49 | 215111 | 8 | 50 | candidate | 1 1 0.98 0.97 0 .96 1 | 0.862 | 0.49 | 205430 | 8 | 50 | baseline | 1 1 0.97 0.93 0.88 0.97 | 0.858 | 0.49 | 203067 | 8 | 50 | baseline | 1 1 0.97 0.93 0.88 0.97 | 0.854 | 0.47 | 205423 | 8 | 50 | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org