somandal opened a new pull request, #15468: URL: https://github.com/apache/pinot/pull/15468
When running table rebalance, we need to specify the `minAvailableReplicas` to keep up as part of the rebalance. Internally we track the number of replicas based on the target assignment only which can be problematic for certain scenarios. For strict replica group based assignment, I found that the rebalance can get into an infinite loop for the following conditions: - Starting replication = 1 - Target replication = 3 - minAvailableReplicas = -1 (or 2) [basically higher than the starting replication] In the above scenario, when doing the assignment for the segments, we hit the following condition: ``` if (availableInstances.size() >= minAvailableReplicas) { ... } else { // New assignment cannot be added, use the current instance state map nextAssignment.put(segmentName, currentInstanceStateMap); return currentAvailableInstances; } ``` Since `availableInstances.size()` is always 1 (since we start with just 1 replica), and `minAvailableReplicas` is 2 based on the configuration passed in to rebalance, we never make progress and keep setting the current assignment for such segments. What I noticed is that for a few of the partitions it does move forward initially, because of this block at the start: ``` if (currentAvailableInstances == null) { // First segment assigned to these instances, use the new assignment and update the available instances nextAssignment.put(segmentName, assignment._instanceStateMap); updateNumSegmentsToOffloadMap(numSegmentsToOffloadMap, currentInstanceStateMap.keySet(), k); return availableInstances; } else { ``` This PR fixes the above by mandating that `minAvailableReplicas` is never higher than the starting replication factor, and if detected that it is higher, it is set to the starting replication factor. It also logs a warning when this scenario is detected. Testing: Added a test case to reproduce this scenario and verified that with the fix the test passes cc @Jackie-Jiang @xiangfu0 @klsince -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org