Re: [PR] Sort servers in WorkerManager to ensure deterministic workerId <-> server mapping across stages in an MSE query [pinot]

via GitHub Fri, 12 Dec 2025 16:33:43 -0800


Jackie-Jiang commented on code in PR #17342:
URL: https://github.com/apache/pinot/pull/17342#discussion_r2615903697



##########
pinot-query-planner/src/main/java/org/apache/pinot/query/routing/WorkerManager.java:
##########
@@ -362,6 +362,10 @@ private List<QueryServerInstance> 
getCandidateServers(DispatchablePlanContext co
     } else {
       candidateServers = getCandidateServersPerTables(context);
     }
+    // Sort to ensure deterministic worker ID assignment across stages.

Review Comment:
   Yeah, this is more for future proof.
   If we don't want the singleton to be deterministic, we can move the sort to 
the caller side, given we don't need to sort for singleton case



##########
pinot-query-planner/src/main/java/org/apache/pinot/query/routing/WorkerManager.java:
##########
@@ -502,14 +506,27 @@ private void 
assignWorkersToNonPartitionedLeafFragment(DispatchablePlanMetadata
         metadata.addUnavailableSegments(tableName, 
routingTable.getUnavailableSegments());
       }
     }
-    int workerId = 0;
+    // Sort server instances to ensure deterministic worker ID assignment.
+    // This is critical for pre-partitioned exchanges where worker ID N on one 
stage
+    // must map to the same physical server as worker ID N on another stage.
+    List<Map.Entry<ServerInstance, Map<String, List<String>>>> 
sortedServerInstanceToSegmentsMap =
+        new ArrayList<>(serverInstanceToSegmentsMap.entrySet());
+    sortedServerInstanceToSegmentsMap.sort(Comparator.comparing(entry -> 
entry.getKey().getInstanceId()));
+
     Map<Integer, QueryServerInstance> workerIdToServerInstanceMap = new 
HashMap<>();

Review Comment:
   (minor) Not introduced in this PR, but we may pre-size these maps 
(`Maps.newHashMapWithExpectedSize()`)



##########
pinot-query-planner/src/main/java/org/apache/pinot/query/routing/WorkerManager.java:
##########
@@ -502,14 +506,27 @@ private void 
assignWorkersToNonPartitionedLeafFragment(DispatchablePlanMetadata
         metadata.addUnavailableSegments(tableName, 
routingTable.getUnavailableSegments());
       }
     }
-    int workerId = 0;
+    // Sort server instances to ensure deterministic worker ID assignment.
+    // This is critical for pre-partitioned exchanges where worker ID N on one 
stage
+    // must map to the same physical server as worker ID N on another stage.
+    List<Map.Entry<ServerInstance, Map<String, List<String>>>> 
sortedServerInstanceToSegmentsMap =
+        new ArrayList<>(serverInstanceToSegmentsMap.entrySet());
+    sortedServerInstanceToSegmentsMap.sort(Comparator.comparing(entry -> 
entry.getKey().getInstanceId()));
+
     Map<Integer, QueryServerInstance> workerIdToServerInstanceMap = new 
HashMap<>();
     Map<Integer, Map<String, List<String>>> workerIdToSegmentsMap = new 
HashMap<>();
-    for (Map.Entry<ServerInstance, Map<String, List<String>>> entry : 
serverInstanceToSegmentsMap.entrySet()) {
-      workerIdToServerInstanceMap.put(workerId, new 
QueryServerInstance(entry.getKey()));
-      workerIdToSegmentsMap.put(workerId, entry.getValue());
-      workerId++;
+
+    // Assign 1 worker per server
+    for (int workerId = 0; workerId < 
sortedServerInstanceToSegmentsMap.size(); workerId++) {

Review Comment:
   (nit) Maybe not important for modern JVM, but I usually cache 
`sortedServerInstanceToSegmentsMap.size()`. Same for the other place



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Sort servers in WorkerManager to ensure deterministic workerId <-> server mapping across stages in an MSE query [pinot]

Reply via email to