somandal opened a new pull request, #15617: URL: https://github.com/apache/pinot/pull/15617
This PR adds support for server-level batching (i.e. how many segment adds to batch at a server level) in Table rebalance. A new `RebalanceConfig` called `batchSizePerServer` has also been added. By default batching is disabled by setting `batchSizePerServer=Integer.MAX_VALUE`. ### Problem Statement Today Table Rebalance performs rebalance in multiple steps. In each step, for every segment we calculate the `getNextSingleSegmentAssignment` keeping the minAvailableReplicas invariant intact as much as possible, and update the IdealState based on that. If there are a large number of segments and most of them need to be moved as part of rebalance, this can lead to a very large number of changes in the IdealState which translates to a large number of STATE_TRANSITION messages for the servers. This can also cause non-rebalance related STATE_TRANSITION messages to get backed up behind the rebalance related ones and cause delays in processing them. ### Solution To tackle the above, this PR adds batching support at the server-level. Batching works differently for non-strict replica group based and strict replica group based: - **Non-strict replica group:** try to add as many segments as possible based on `getNextSingleSegmentAssignment` without going over `batchSizePerServer` for any server and without splitting up the `getNextSingleSegmentAssignment` into further steps. It tracks the `segmentsAddedSoFar` in a given nextAssignment calculation for each server. It is possible to have fewer than `batchSizePerServer` for some servers if we run out of segments to add without violating the `batchSizePerServer` for any given server. - **Strict replica group:** these need to be moved as a whole partition to keep the consistency invariant. Due to this just choosing segments at random is not possible. Instead the segments are grouped by `partitionId` and all segments of a given `partitionId` are chosen to be moved even if this violates the `batchSizePerServer`. Once this threshold is hit for a given server, no further partitions will be moved that will affect that server. Thus this batching is best efforts only. ### Testing - Added some tests - Tried this out manually and locally in `HybridQuickStart` for an OFFLINE table and forcing a REALTIME table to be assigned with `StrictRealtimeSegmentAssignment` and setup with strict replica group based routing config. (added some delays to processing segment state transitions to watch how the progress is made via progress stats) - Validated that existing rebalance related tests that don't do batching pass as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org