[PR] Table Rebalance: Add support for server-level batching [pinot]

via GitHub Tue, 22 Apr 2025 12:26:59 -0700


somandal opened a new pull request, #15617:
URL: https://github.com/apache/pinot/pull/15617


   This PR adds support for server-level batching (i.e. how many segment adds 
to batch at a server level) in Table rebalance. A new `RebalanceConfig` called 
`batchSizePerServer` has also been added. By default batching is disabled by 
setting `batchSizePerServer=Integer.MAX_VALUE`.
   
   ### Problem Statement
   Today Table Rebalance performs rebalance in multiple steps. In each step, 
for every segment we calculate the `getNextSingleSegmentAssignment` keeping the 
minAvailableReplicas invariant intact as much as possible, and update the 
IdealState based on that. If there are a large number of segments and most of 
them need to be moved as part of rebalance, this can lead to a very large 
number of changes in the IdealState which translates to a large number of 
STATE_TRANSITION messages for the servers. This can also cause non-rebalance 
related STATE_TRANSITION messages to get backed up behind the rebalance related 
ones and cause delays in processing them.
   
   ### Solution
   To tackle the above, this PR adds batching support at the server-level. 
Batching works differently for non-strict replica group based and strict 
replica group based:
   
   - **Non-strict replica group:** try to add as many segments as possible 
based on `getNextSingleSegmentAssignment` without going over 
`batchSizePerServer` for any server and without splitting up the 
`getNextSingleSegmentAssignment` into further steps. It tracks the 
`segmentsAddedSoFar` in a given nextAssignment calculation for each server. It 
is possible to have fewer than `batchSizePerServer` for some servers if we run 
out of segments to add without violating the `batchSizePerServer` for any given 
server.
   - **Strict replica group:** these need to be moved as a whole partition to 
keep the consistency invariant. Due to this just choosing segments at random is 
not possible. Instead the segments are grouped by `partitionId` and all 
segments of a given `partitionId` are chosen to be moved even if this violates 
the `batchSizePerServer`. Once this threshold is hit for a given server, no 
further partitions will be moved that will affect that server. Thus this 
batching is best efforts only.
   
   ### Testing
   - Added some tests
   - Tried this out manually and locally in `HybridQuickStart` for an OFFLINE 
table and forcing a REALTIME table to be assigned with 
`StrictRealtimeSegmentAssignment` and setup with strict replica group based 
routing config. (added some delays to processing segment state transitions to 
watch how the progress is made via progress stats)
   - Validated that existing rebalance related tests that don't do batching 
pass as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[PR] Table Rebalance: Add support for server-level batching [pinot]

Reply via email to