yashmayya commented on issue #15683: URL: https://github.com/apache/pinot/issues/15683#issuecomment-2938983837
> - Create a TableRebalance manager class to oversee the creation and management of rebalance jobs on tables > - Have all calls to create rebalance jobs go through the above manager class, include periodic tasks like SegmentRelocator > - Track ongoing rebalances and reject rebalance jobs for tables already ongoing rebalance > - We could potentially also have a thread pool mechanism to limit the number of jobs spawned at a time > - Enforce that all jobs enable progress stats tracking so that their status can be stored in ZK https://github.com/apache/pinot/pull/15990 addresses these issues. <hr> > We will need to ensure that we can handle scenarios where a controller dies that had ongoing rebalance jobs. These will have a status in ZK, but when a new controller (or the old controller on start-up) identifies this scenario, it should gracefully handle it (i.e. in this scenario it should start a new job even though in ZK there exists a job for the table with IN_PROGRESS status) -> how to detect failed controller scenario and start job vs. avoid starting job since one is already running and controller is up and healthy? This is already handled by the periodic [RebalanceChecker](https://github.com/apache/pinot/blob/608f89134e9715fa508f2f800c1920d774fe6e52/pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/rebalance/RebalanceChecker.java#L54) controller job. If a controller dies, the leadership for the tables it was previously a leader for will move to a new controller. This new controller will run the `RebalanceChecker` job periodically (every 5 minutes by default) and will try to detect such failed or stuck rebalances for all the tables it is a leader for. If there is a rebalance job whose ZK metadata indicates that it hasn't been updated for more than `heartbeatTimeoutInMs` (rebalance config - defaults to 1 hour), it will be marked as `ABORTED` and a new rebalance will be triggered for the table by the controller (using the same rebalance config). <hr> > Today we don't clean up ZK job status ZNodes. Thus they can grow indefinitely and we may start hitting the ZNode size limitations. We should periodically clean up older job statuses This isn't done periodically today, but there's a hard limit of 100 jobs of each type beyond which older ones will be cleaned up - https://github.com/apache/pinot/blob/c15440466a0032c5f74e55940792fb16cd719760/pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/PinotHelixResourceManager.java#L2552-L2558 This isn't ideal, though, and we could maybe make this configurable at least. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org