Jackie-Jiang commented on code in PR #11953: URL: https://github.com/apache/pinot/pull/11953#discussion_r1423299335
########## pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/assignment/instance/InstanceReplicaGroupPartitionSelector.java: ########## @@ -73,16 +76,65 @@ public void selectInstances(Map<Integer, List<InstanceConfig>> poolToInstanceCon Map<Integer, List<Integer>> poolToReplicaGroupIdsMap = new TreeMap<>(); Map<Integer, Integer> replicaGroupIdToPoolMap = new TreeMap<>(); Map<Integer, Set<String>> poolToCandidateInstancesMap = new TreeMap<>(); - for (int replicaId = 0; replicaId < numReplicaGroups; replicaId++) { - // Pick one pool for each replica-group based on the table name hash - int pool = pools.get((tableNameHash + replicaId) % numPools); - poolToReplicaGroupIdsMap.computeIfAbsent(pool, k -> new ArrayList<>()).add(replicaId); - replicaGroupIdToPoolMap.put(replicaId, pool); + Map<String, Integer> instanceToPoolMap = new HashMap<>(); + for (Map.Entry<Integer, List<InstanceConfig>> entry : poolToInstanceConfigsMap.entrySet()) { + Integer pool = entry.getKey(); + List<InstanceConfig> instanceConfigsInPool = entry.getValue(); + Set<String> candidateInstances = poolToCandidateInstancesMap.computeIfAbsent(pool, k -> new LinkedHashSet<>()); + for (InstanceConfig instanceConfig : instanceConfigsInPool) { + String instanceName = instanceConfig.getInstanceName(); + candidateInstances.add(instanceName); + instanceToPoolMap.put(instanceName, pool); + } + } - Set<String> candidateInstances = - poolToCandidateInstancesMap.computeIfAbsent(pool, k -> new LinkedHashSet<>()); - List<InstanceConfig> instanceConfigsInPool = poolToInstanceConfigsMap.get(pool); - instanceConfigsInPool.forEach(k -> candidateInstances.add(k.getInstanceName())); + if (_minimizeDataMovement && _existingInstancePartitions != null) { + // Keep the same pool for the replica group if it's already been used for the table. + int existingNumPartitions = _existingInstancePartitions.getNumPartitions(); + int existingNumReplicaGroups = _existingInstancePartitions.getNumReplicaGroups(); + int numCommonReplicaGroups = Math.min(numReplicaGroups, existingNumReplicaGroups); Review Comment: What I meant is that we didn't really pick the optimal pool to minimize the data movement. Currently we pick a pool when any existing server shows up in the pool, even if there are more servers shared with another pool. This algorithm works well when servers are not moved from one pool to another, but might not return best pool otherwise. Another algorithm which always give the best pool is to track the shared server count within all pools, and pick the pool with most shared servers. More importantly, the current algorithm could cause wrong assignment in the following scenario: Existing instance partitions: RG0: [s0, s1], RG1: [s2, s3] Pools: P0: [s0, s1, s2], P1: [s3, s4, s5] Current algorithm will assign both PRs to P0, but P0 doesn't have enough servers to hold both RGs To make it work the same way as default assignment, we need to track the maximum RGs assigned to a pool, and not assign more RGs to a pool when it already reaches the max. I'd suggest adding a test to catch such scenario -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org