somandal commented on code in PR #16008: URL: https://github.com/apache/pinot/pull/16008#discussion_r2140607980
########## pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/rebalance/RebalanceChecker.java: ########## @@ -111,18 +111,35 @@ private synchronized int retryRebalanceTables(Set<String> tableNamesWithType) { @VisibleForTesting void retryRebalanceTable(String tableNameWithType, Map<String, Map<String, String>> allJobMetadata) throws Exception { + // We first check for rebalance jobs that are stuck - i.e., those that are IN_PROGRESS but haven't updated their + // status in ZK within the heartbeat timeout. This could occur if a controller crashes while running a rebalance job + // for instance. These stuck jobs are always aborted here. They will also be retried if it's allowed as per the + // rebalance context. + // // Skip retry for the table if rebalance job is still running or has completed, in specific: // 1) Skip retry if any rebalance job is actively running. Being actively running means the job is at IN_PROGRESS // status, and has updated its status kept in ZK within the heartbeat timeout. It's possible that more than one // rebalance jobs are running for the table, but that's fine with idempotent rebalance algorithm. // 2) Skip retry if the most recently started rebalance job has completed with DONE or NO_OP. It's possible that // jobs started earlier may be still running, but they are ignored here. // - // Otherwise, we can get a list of failed rebalance jobs, i.e. those at FAILED status; or IN_PROGRESS status but - // haven't updated their status kept in ZK within the heartbeat timeout. For those candidate jobs to retry: + // Note that it should be very unlikely to have scenarios where there are more than one rebalance jobs running + // for a table at the same time, or to have a rebalance job that started earlier but completed later than the one + // started most recently since we try to prevent new rebalances from being triggered while a rebalance is in + // progress by checking ZK metadata. Such scenarios can only occur if multiple rebalance jobs are triggered at the + // same time and the second one is started before the first one updates its status in ZK. + // + // If we detect that a retry is required based on the above criteria, we can get a list of failed rebalance jobs, + // i.e. those at FAILED status; or IN_PROGRESS status but haven't updated their status kept in ZK within the + // heartbeat timeout. For those candidate jobs to retry: // 1) Firstly, group them by the original jobIds they retry for so that we can skip those exceeded maxRetry. // 2) For the remaining jobs, we take the one started most recently and retry it with its original configs. // 3) If configured, we can abort the other rebalance jobs for the table by setting their status to FAILED. + + if (hasStuckInProgressJobs(tableNameWithType, allJobMetadata)) { + abortExistingJobs(tableNameWithType, _pinotHelixResourceManager); + } Review Comment: got it, makes sense! let's make sure we test out a bunch of scenarios once this code is in (e.g. kill the controller, have some combination of aborted / in progress jobs / user triggered etc) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org