Jackie-Jiang commented on PR #8441: URL: https://github.com/apache/pinot/pull/8441#issuecomment-1087919194
We assign the servers in the following steps: 1. Pick the pools for the table based on the tenant and pool config 2. Apply the constraint to the servers if any 3. Map each replica to a server pool 4. Pick servers from the server pool Currently all steps are deterministic, and it should be very rare to add/remove pools, so it should be okay to move more segments if the pool count is changed. If we can assume the first 3 steps do not change, then the algorithm can be very straight forward: simply keep the original servers if the server exists in the pool; or replace it with a new server if not. This algorithm should also be implemented in a deterministic way. If we want to solve the corner case of adding/removing pools, we can save the pool id into the instance partitions for each replica, and keep them fixed during the re-assignment. Some potential problems with the current approach: 1. For a large cluster, there can be hundreds or even more servers for each pool. Storing them in the instance partitions can add overhead, and can be very hard to debug 2. The overall idea is to optimize the server selection to minimize the movement, so the logic should be applied to the server selection step instead of the pool selection step > IMO there is no hard requirement for a pool id to be mapped 1:1 1:N to a replica id, right? It's just in current strategy of InstanceReplicaGroupPartitionSelector we assign instances to a replica from one pool. But this should not be enforced for future use, especially right now we are implementing selector with FD awareness and it can have instances from multiple pools in 1 replica group. In other words we should not rely on the status-quo of we can "reverse engineering the pool id from replica group id". So I think the pool -> instance mapping should probably be saved. We don't rely on reverse engineer, but deterministic selection algorithm. Storing pool -> server mapping can be very costly. Processing them can be costly as well. We may store the replica-group -> pool mapping, but not the individual servers -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org