Jackie-Jiang commented on code in PR #15617: URL: https://github.com/apache/pinot/pull/15617#discussion_r2054996050
########## pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/rebalance/TableRebalancer.java: ########## @@ -479,6 +482,16 @@ private RebalanceResult doRebalance(TableConfig tableConfig, RebalanceConfig reb externalViewStabilizationTimeoutInMs); int expectedVersion = currentIdealState.getRecord().getVersion(); + // Cache segment partition id to avoid ZK reads. Similar behavior as cache used in StrictReplicaGroupAssignment + // NOTE: + // 1. This cache is used for table rebalance only, but not segment assignment. During rebalance, rebalanceTable() + // can be invoked multiple times when the ideal state changes during the rebalance process. + // 2. The cache won't be refreshed when an existing segment is replaced with a segment from a different partition. + // Replacing a segment with a segment from a different partition should not be allowed for upsert table because + // it will cause the segment being served by the wrong servers. If this happens during the table rebalance, + // another rebalance might be needed to fix the assignment. + Object2IntOpenHashMap<String> segmentPartitionIdMap = new Object2IntOpenHashMap<>(); Review Comment: For my understanding, is this introduced to reduce the overhead of computing the partition id for the same segment multiple times? Does this improvement also apply to the current algorithm? ########## pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/rebalance/TableRebalancer.java: ########## @@ -598,9 +611,12 @@ private RebalanceResult doRebalance(TableConfig tableConfig, RebalanceConfig reb "Rebalance has stopped already before updating the IdealState", instancePartitionsMap, tierToInstancePartitionsMap, targetAssignment, preChecksResult, summaryResult); } + String partitionColumn = TableConfigUtils.getPartitionColumn(tableConfig); Map<String, Map<String, String>> nextAssignment = getNextAssignment(currentAssignment, targetAssignment, minAvailableReplicas, enableStrictReplicaGroup, Review Comment: Seems most of the new arguments are used to calculate the partition id, consider passing a function (lambda) to simplify this ########## pinot-controller/src/main/java/org/apache/pinot/controller/api/resources/PinotTableRestletResource.java: ########## @@ -624,6 +624,13 @@ public RebalanceResult rebalance( + "more servers.") @DefaultValue("false") @QueryParam("lowDiskMode") boolean lowDiskMode, @ApiParam(value = "Whether to use best-efforts to rebalance (not fail the rebalance when the no-downtime " + "contract cannot be achieved)") @DefaultValue("false") @QueryParam("bestEfforts") boolean bestEfforts, + @ApiParam(value = "How many maximum segment adds per server to update in the IdealState in each step. For " + + "non-strict replica group based assignment, this number will be the closest possible without splitting up " + + "a single segment's step's replicas across steps (so some servers may get fewer segments). For strict " + + "replica group based assignment, this is a per-server best effort value since each partition of a replica " + + "group must be moved as a whole and at least one partition in a replica group should be moved. A value of " + + "Integer.MAX_VALUE is used to indicate an unlimited batch size, which is the non-batching behavior.") + @DefaultValue("2147483647") @QueryParam("batchSizePerServer") int batchSizePerServer, Review Comment: Let's make default `0` or `-1` and treat non-positive value as unlimited -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org