Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]

via GitHub Thu, 08 Aug 2024 05:12:54 -0700


msokolov commented on PR #13566:
URL: https://github.com/apache/lucene/pull/13566#issuecomment-2275670617


   I posted an updated patch that uses an approach that is more aligned with 
how we search. The idea is we don't need to care about strongly-connected 
graphs; we just want to make sure there is a path to every document via search. 
This means every document can be reached from the entry point, which is a 
single node on the top level. So this patch tries to ensure that every node on 
each level is reachable from one of the entry points of that level (which are 
just the nodes on the next higher level).  This is easy to compute and seems to 
correspond to the property we would like to guarantee: no orphaned nodes.
   
   The patch also has a stronger approach to linking up disconnected 
components. Previously we would check the closest N nodes and if none of them 
could be linked (ie they were all full), we would give up. This patch tracks 
all the non-full nodes and uses that bitset as a filter on the search, so if 
there are any, it will find them. It also does not insist on bidirectional 
linking, which isn't necessary to get the property of reachability.
   
   The results do show that this approach does a good (better than before) job 
of linking about disconnected components. You have to look hard to see a 
measurable impact on recall. Latency and indexing time don't seem to be 
affected 
   
   In this test I measured with low values of M in order to get disconnected 
graphs. This test used  1M 256-dim vectors from an AMZ internal data set. I 
forced-merge but did not use concurrent merging (an oversight, it should work). 
A bunch of other tests showed improvements in connectedness, but not any 
significant change in other metrics.  Connectedness reported here is what 
KnnGraphTester prints - it's basically % of nodes reachable from a single node 
(the entryPoint node) on each level. Not exactly the same as the criteria being 
optimized, but it is closely correlated.
   
   ||     test || connectedness          || recall || latency || index time || 
M || beam ||
   candidate   |  1 0.98 0.98 0.97 0.96 1 |  0.860 |     0.49 |    215111   |  
8 |    50 |
   candidate   |  1 1    0.98 0.97 0 .96  1 |  0.862 |     0.49 |    205430   | 
 8 |    50 |
   baseline    |  1 1 0.97 0.93 0.88 0.97  |  0.858 |     0.49 |    203067   |  
8 |    50 |
   baseline    |  1 1 0.97 0.93 0.88 0.97  |  0.854 |     0.47 |    205423   |  
8 |    50 |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] gh-12627: HnswGraphBuilder connects disconnected HNSW graph components [lucene]

Reply via email to