jasperjiaguo commented on PR #8441: URL: https://github.com/apache/pinot/pull/8441#issuecomment-1088042522
> We assign the servers in the following steps: > > 1. Pick the pools for the table based on the tenant and pool config > 2. Apply the constraint to the servers if any > 3. Map each replica to a server pool > 4. Pick servers from the server pool > > Currently all steps are deterministic, and it should be very rare to add/remove pools, so it should be okay to move more segments if the pool count is changed. If we can assume the first 3 steps do not change, then the algorithm can be very straight forward: simply keep the original servers if the server exists in the pool; or replace it with a new server if not. This algorithm should also be implemented in a deterministic way. > > If we want to solve the corner case of adding/removing pools, we can save the pool id into the instance partitions for each replica, and keep them fixed during the re-assignment. > > Some potential problems with the current approach: > > 1. For a large cluster, there can be hundreds or even more servers for each pool. Storing them in the instance partitions can add overhead, and can be very hard to debug > 2. The overall idea is to optimize the server selection to minimize the movement, so the logic should be applied to the server selection step instead of the pool selection step > > > IMO there is no hard requirement for a pool id to be mapped 1:1 1:N to a replica id, right? It's just in current strategy of InstanceReplicaGroupPartitionSelector we assign instances to a replica from one pool. But this should not be enforced for future use, especially right now we are implementing selector with FD awareness and it can have instances from multiple pools in 1 replica group. > > In other words we should not rely on the status-quo of we can "reverse engineering the pool id from replica group id". So I think the pool -> instance mapping should probably be saved. > > We don't rely on reverse engineer, but deterministic selection algorithm. Storing pool -> server mapping can be very costly. Processing them can be costly as well. We may store the replica-group -> pool mapping, but not the individual servers Yes I agree that ideally we want to rely on the determinism of selection algorithm to deduce the vacant and pool id of down servers, without preserving any state. But IMO storing RG -> pool mapping is still not enough, take the following case as an example: 24 instances in 5 pools, with [5,5,5,4,5] servers in each pool, ie [s0-s4, s5-s9, s10-s14, s15-18, s19-s23] we want to assign this 21 instances to 3 RGs, a instance selection strategy gives: RG0: [s0-s7] RG1: [s8-s15] RG2: [s16-s23] this yields a RG->pool mapping of: RG0 -> pool0, pool1 RG2 -> pool1, pool2, pool3 RG3 -> pool3, pool4 taking out 1 instances s6(poo1), s11(pool2), and s21(pool4), it will be [5,4,4,4,4] severs in each pool and [7, 7, 7] servers in each RG. Then for InstanceTagPoolSelector/InstanceReplicaGroupPartitionSelector it would be very hard to figure out (1) to which pool the lost instances belong to (2) where the vacant seats in the each pool -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org