yashmayya commented on code in PR #16008: URL: https://github.com/apache/pinot/pull/16008#discussion_r2140531538
########## pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/rebalance/RebalanceChecker.java: ########## @@ -111,18 +111,35 @@ private synchronized int retryRebalanceTables(Set<String> tableNamesWithType) { @VisibleForTesting void retryRebalanceTable(String tableNameWithType, Map<String, Map<String, String>> allJobMetadata) throws Exception { + // We first check for rebalance jobs that are stuck - i.e., those that are IN_PROGRESS but haven't updated their + // status in ZK within the heartbeat timeout. This could occur if a controller crashes while running a rebalance job + // for instance. These stuck jobs are always aborted here. They will also be retried if it's allowed as per the + // rebalance context. + // // Skip retry for the table if rebalance job is still running or has completed, in specific: // 1) Skip retry if any rebalance job is actively running. Being actively running means the job is at IN_PROGRESS // status, and has updated its status kept in ZK within the heartbeat timeout. It's possible that more than one // rebalance jobs are running for the table, but that's fine with idempotent rebalance algorithm. // 2) Skip retry if the most recently started rebalance job has completed with DONE or NO_OP. It's possible that // jobs started earlier may be still running, but they are ignored here. // - // Otherwise, we can get a list of failed rebalance jobs, i.e. those at FAILED status; or IN_PROGRESS status but - // haven't updated their status kept in ZK within the heartbeat timeout. For those candidate jobs to retry: + // Note that it should be very unlikely to have scenarios where there are more than one rebalance jobs running + // for a table at the same time, or to have a rebalance job that started earlier but completed later than the one + // started most recently since we try to prevent new rebalances from being triggered while a rebalance is in + // progress by checking ZK metadata. Such scenarios can only occur if multiple rebalance jobs are triggered at the + // same time and the second one is started before the first one updates its status in ZK. + // + // If we detect that a retry is required based on the above criteria, we can get a list of failed rebalance jobs, + // i.e. those at FAILED status; or IN_PROGRESS status but haven't updated their status kept in ZK within the + // heartbeat timeout. For those candidate jobs to retry: // 1) Firstly, group them by the original jobIds they retry for so that we can skip those exceeded maxRetry. // 2) For the remaining jobs, we take the one started most recently and retry it with its original configs. // 3) If configured, we can abort the other rebalance jobs for the table by setting their status to FAILED. + + if (hasStuckInProgressJobs(tableNameWithType, allJobMetadata)) { + abortExistingJobs(tableNameWithType, _pinotHelixResourceManager); + } Review Comment: > Is it safe to only abort if stuck IN_PROGRESS jobs are there? Earlier we'd do this irrespective right? The `abortExistingJobs` method basically updates all `IN_PROGRESS` rebalance jobs to `ABORTED` for the table. Earlier, it was only called when the `candidateJobs` map wasn't empty, so it was effectively being called only when there are stuck `IN_PROGRESS` jobs (other cases would be no-ops). > and just wondering if there are any corner cases we'll be missing if we move this upfront instead of aborting it only after the delay check? This shouldn't really affect anything and I'd argue it actually improves the situation. The delay should only be for retry attempts, and stuck rebalance jobs should be aborted as soon as we detect that the heartbeat timeout has elapsed. > Also, it should be safe to do this after we fetch and check for candidateJobs existence? that way we can ensure that the metrics TABLE_REBALANCE_FAILURE_DETECTED is updated if there is a need to retry and we aborted jobs This should happen even now, because candidate jobs include `ABORTED` rebalances. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org