somandal commented on PR #15266: URL: https://github.com/apache/pinot/pull/15266#issuecomment-2779453530
Just capturing a scenario here that I have been debugging lately: Scenario: - When we try to update the IdealState in the rebalance code, if the node version changed an exception is thrown and we try to redo the whole rebalance loop - The current code updates the stats prior to the IdealState update (ideal state trigger and next assignment trigger) - If we are unable to update the IdealState we should rollback the stats since they will be recalculated on the next iteration anyways. - The older stats are also susceptible to this problem Impact of not rolling back: the stats can show more progress has been made than the reality for that iteration until the IS update happens. The IS update will reset it correctly, but until then the stats will be incorrect. Also, the above problem is more problematic if bestEfforts=true because we may not wait for EV-IS full convergence before moving to the next step. Due to this, when we loop back after IS update fails, we get into the EV-IS convergence loop again and it looks like we are making progress, but we’re actually making progress on the last step updates that hadn't completed. I’m thinking of adding the above rollback logic in a follow up PR (my plan is to just cache the previous versions of the stats and use that for rollback) to avoid over-complicating this one. Let me know if that is fine with you folks @klsince @raghavyadav01 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org