mayya-sharipova opened a new pull request, #14380: URL: https://github.com/apache/lucene/pull/14380
Backport for #14331 Currently when doing merging of HNSW graphs incrementally, we first initialize a graph from the biggest segment, and for other segments, we rebuild the graphs completely by going through a segment's vector values one by one, searching for it in the new graph to find best neighbours to connect it with. This PR proposes more efficient merging based on the idea if we know where we want to insert a node, we have a good idea of where we want to insert its neighbours. Similarly to the current approach, we initialize a new graph from the biggest segment. For all other segments, we find a smaller set of nodes that "covers" their graph, and we insert that set as usual. For other nodes, outside of J sets, we do lighter searches with pre-calculated eps. This allows substantial speedups in merging (up to 2x in force-merge). The algorithm is based on the following steps: 1. Get all graphs that don't have deletions and sort them by size (descending). 2. Copy the largest graph to the new graph (`gL`). 3. For each remaining small graph (`gS`): - Find the nodes that best cover `gS` (join set `j`). These nodes will be inserted into `gL` as usual: by searching `gL` to find the best candidates (`w`) to which connect the nodes. - For each remaining node in `gS` do "lighter" searches: - We provide `eps` to search in `gL`. We form `eps` by the union of the node's neighbors in `gS` and the node's neighbors' neighbors in `gL`. We also limit `beamWidth` (`efConstruction` ) to `M * 3`. Algorithm designed by Thomas Veasey ### Description <!-- If this is your first contribution to Lucene, please make sure you have reviewed the contribution guide. https://github.com/apache/lucene/blob/main/CONTRIBUTING.md --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org