somandal opened a new pull request, #15427: URL: https://github.com/apache/pinot/pull/15427
## Description This PR enhances the Table Rebalance progress stats. Today the progress stats can be confusing due to the various sections (there are 3), and it is hard to track progress because most of the time only the EV-IS convergence step is updated and this is only calculated for a single step and not for the full rebalance. To understand the progress stats, the code that does the stats updates must be well understood which is not a good assumption and experience for end users. Another concern with the existing stats is that the percentage progress calculations are based off the total segments in the target, rather than based on the actual segments that need to be added / deleted. This makes it harder to assess the progress of the rebalance since not all segments may even need to change. This PR adds the following to the existing progress stats: ``` "rebalanceProgressStatsOverall" : { "totalSegmentsToBeAdded" : 30, "totalSegmentsToBeDeleted" : 30, "totalRemainingSegmentsToBeAdded" : 14, "totalRemainingSegmentsToBeDeleted" : 11, "totalCarryOverSegmentsToBeAdded" : 0, "totalCarryOverSegmentsToBeDeleted" : 0, "totalRemainingSegmentsToConverge": 0, "totalUniqueNewUntrackedSegmentsDuringRebalance" : 10, "percentageTotalSegmentsAddsRemaining" : 46.666666666666664, "percentageTotalSegmentDeletesRemaining" : 36.666666666666664, "estimatedTimeToCompleteAddsInSeconds" : 71.768375, "estimatedTimeToCompleteDeletesInSeconds" : 47.48584210526316, "averageSegmentSizeInBytes" : 35918, "totalEstimatedDataToBeMovedInBytes" : 1077540, "startTimeMs" : 1742319763143 }, "rebalanceProgressStatsCurrentStep" : { "totalSegmentsToBeAdded" : 10, "totalSegmentsToBeDeleted" : 10, "totalRemainingSegmentsToBeAdded" : 4, "totalRemainingSegmentsToBeDeleted" : 1, "totalCarryOverSegmentsToBeAdded" : 0, "totalCarryOverSegmentsToBeDeleted" : 0, "totalRemainingSegmentsToConverge": 0, "totalUniqueNewUntrackedSegmentsDuringRebalance" : 0, "percentageTotalSegmentsAddsRemaining" : 40.0, "percentageTotalSegmentDeletesRemaining" : 10.0, "estimatedTimeToCompleteAddsInSeconds" : 16.923333333333332, "estimatedTimeToCompleteDeletesInSeconds" : 2.8205555555555555, "averageSegmentSizeInBytes" : 35918, "totalEstimatedDataToBeMovedInBytes" : 359180, "startTimeMs" : 1742319819779 }, ``` The overall stats (`rebalanceProgressStatsOverall`) above are updated at each step (`rebalanceProgressStatsCurrentStep`) as well to capture the progress during IS-EV convergence. This is to ensure that users only need to monitor `rebalanceProgressStatsOverall` to find out how much work has already been done. The `rebalanceProgressStatsCurrentStep` can be used for debugging but does not need to be monitored as such. `_totalUniqueNewUntrackedSegmentsDuringRebalance`: this field is added to track new segments that show up in the IdealState but which aren't being tracked as part of the rebalance. This is just informational and can be used to identify how many additional segments got added during the rebalance run. The time estimates are based on how much time has passed and how many segments were already processed and how many are remaining. This may not be a very accurate measure and is not based on the actual data moving since different set ups may have different throughput characteristics. This PR also renames some of the existing stats to make them clearer to understand. ## Interesting Scenario While experimenting with QuickStart and REALTIME tables, I ran into a situation where rebalance would complete successfully, but the stats I added would always show some segment deletions as not yet processed. Digging some more, found the following: https://github.com/apache/pinot/blob/604cd165586716ae04db22856371e50147e30459/pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/rebalance/TableRebalancer.java#L1074 The above only checks for extra assigned instances for given segments if `lowDiskMode` is enabled. So in my scenario, there were some extra instances assigned when I did the stats calculations, but the above check was never performed since `lowDiskMode=false`. The EV-IS convergence was marked as true in this case and the rebalance was completed. When I updated the code to comment out `lowDiskMode &&` from the above if, I saw that eventually the EV does get updated to remove the deleted segments, and the EV-IS do match up exactly, but that needs some additional convergence loops. In this case the stats were correctly updated to reflect the deletions. **As a fix for the above:** updated the code to remove `lowDiskMode &&` so that this convergence loop always runs to ensure deletions are converged for all scenarios and not just for `lowDiskMode`. ## Testing: All testing done with `minAvailableReplicas = -1` ### Test 1 (no delay in segment adds) - Ran HybridQuickStart with 6 servers - Tagged 3 servers with `NewDefaultTenant_OFFLINE` and removed `DefaultTenant_OFFLINE` from these servers - Ran rebalance of two types: - Updated replication of OFFLINE table from 1 -> 3, this got triggered as single step rebalance to add all the new replicas - Updated server tenant of OFFLINE table to `NewDefaultTenant` (triggered as a 3 step rebalance to move 1 replica at a time) ### Test 2 (delay in segment adds/deletes) - Added a random delay between 10 - 30 seconds in the `onBecomeOnlineFromOffline()`, `onBecomeOnlineFromConsuming()`, `onBecomeConsumingFromOffline()` state transition callbacks - Added a random delay between 0 - 30 seconds in the ` onBecomeDroppedFrom*()` state transition callbacks - Ran HybridQuickStart with 6 servers - Tagged 3 servers with `NewDefaultTenant_OFFLINE` and removed `DefaultTenant_OFFLINE` from these servers - Ran rebalance of two types (both were triggered as multi-step rebalances): - Updated replication of OFFLINE table from 1 -> 3, monitored the steps (due to random delay in state transition, could see some segments getting added sooner and others take longer) - Updated server tenant of OFFLINE table to `NewDefaultTenant` and similarly monitored the steps ### Additional testing (with delays): - REALTIME table with and without `includeConsuming` (similar steps as for OFFLINE above) - Scenario where some segments get added to the IdealState in the middle of rebalance (e.g. forceCommit for REALTIME table) to tag some servers as `NewDefaultTenant_REALTIME` and remove `DefaultTenant_REALTIME` tag - Test to force some segments into `ERROR` state, and use `bestEfforts=true` to validate that convergence stats are correctly captured. Also verified convergence stats with `forceCommit` since the existing `CONSUMING` segments change to `ONLINE` and the rebalance correctly waits (and stats are correctly updated) until the state convergence also completes. - Set some segments to `OFFLINE` state in ideal state. Verified that while they are in `OFFLINE` state in ideal state, we skip progress tracking for those. Eventually if they get updated to ONLINE state, we track them again and wait for convergence. - Updated the code to add fields for carry-over segments that didn't converge in the last step. This will mostly occur if `bestEffort=true`, and we are unable to complete convergence within the timeout. The other scenario was with `lowDiskMode` and extra deletions from EV, but we changed the code here to always wait for additional deletions in the EV. The percentage calculations include these carry-over segments. cc @raghavyadav01 @Jackie-Jiang @klsince @J-HowHuang Something went wrong with the branch of my original PR: https://github.com/apache/pinot/pull/15266 This PR is the exact same. I'll close the other one -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org