mayya-sharipova opened a new pull request, #14380:
URL: https://github.com/apache/lucene/pull/14380

   Backport for #14331
   
   Currently when doing merging of HNSW graphs incrementally, we first 
initialize a graph from the biggest segment, and for other segments, we rebuild 
the graphs completely by going through a segment's vector values one by one, 
searching for it in the new graph to find best neighbours to connect it with.
   
   This PR proposes more efficient merging based on the idea if we know where 
we want to insert a node, we have a good idea of where we want to insert its 
neighbours. Similarly to the current approach, we initialize a new graph from 
the biggest segment. For all other segments, we find a smaller set of nodes 
that "covers" their graph, and we insert that set as usual. For other nodes, 
outside of J sets, we do lighter searches with pre-calculated eps.
   
   This allows substantial speedups in merging (up to 2x in force-merge).
   
   The algorithm is based on the following steps:
   
   1. Get all graphs that don't have deletions and sort them by size 
(descending).
   2. Copy the largest graph to the new graph (`gL`).
   3. For each remaining small graph (`gS`):
      - Find the nodes that best cover `gS` (join set `j`). These nodes will be 
inserted into `gL` as usual: by searching `gL` to find the best candidates 
(`w`) to which connect the nodes.
      - For each remaining node in `gS` do "lighter" searches:
        - We provide `eps` to search in `gL`. We form `eps` by the union of the 
node's neighbors in `gS` and the node's neighbors' neighbors in `gL`. We also 
limit `beamWidth` (`efConstruction` ) to `M * 3`.
   
   Algorithm designed by Thomas Veasey
   
   ### Description
   
   <!--
   If this is your first contribution to Lucene, please make sure you have 
reviewed the contribution guide.
   https://github.com/apache/lucene/blob/main/CONTRIBUTING.md
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to