jasperjiaguo commented on PR #8441:
URL: https://github.com/apache/pinot/pull/8441#issuecomment-1088042522

   > We assign the servers in the following steps:
   > 
   > 1. Pick the pools for the table based on the tenant and pool config
   > 2. Apply the constraint to the servers if any
   > 3. Map each replica to a server pool
   > 4. Pick servers from the server pool
   > 
   > Currently all steps are deterministic, and it should be very rare to 
add/remove pools, so it should be okay to move more segments if the pool count 
is changed. If we can assume the first 3 steps do not change, then the 
algorithm can be very straight forward: simply keep the original servers if the 
server exists in the pool; or replace it with a new server if not. This 
algorithm should also be implemented in a deterministic way.
   > 
   > If we want to solve the corner case of adding/removing pools, we can save 
the pool id into the instance partitions for each replica, and keep them fixed 
during the re-assignment.
   > 
   > Some potential problems with the current approach:
   > 
   > 1. For a large cluster, there can be hundreds or even more servers for 
each pool. Storing them in the instance partitions can add overhead, and can be 
very hard to debug
   > 2. The overall idea is to optimize the server selection to minimize the 
movement, so the logic should be applied to the server selection step instead 
of the pool selection step
   > 
   > > IMO there is no hard requirement for a pool id to be mapped 1:1 1:N to a 
replica id, right? It's just in current strategy of 
InstanceReplicaGroupPartitionSelector we assign instances to a replica from one 
pool. But this should not be enforced for future use, especially right now we 
are implementing selector with FD awareness and it can have instances from 
multiple pools in 1 replica group.
   > > In other words we should not rely on the status-quo of we can "reverse 
engineering the pool id from replica group id". So I think the pool -> instance 
mapping should probably be saved.
   > 
   > We don't rely on reverse engineer, but deterministic selection algorithm. 
Storing pool -> server mapping can be very costly. Processing them can be 
costly as well. We may store the replica-group -> pool mapping, but not the 
individual servers
   
   Yes I agree that ideally we want to rely on the determinism of selection 
algorithm to deduce the vacant and pool id of down servers, without preserving 
any state. But IMO storing RG -> pool mapping is still not enough, take the 
following case as an example:
   
     24 instances in 5 pools, with [5,5,5,4,5] servers in each pool, ie [s0-s4, 
s5-s9, s10-s14, s15-18, s19-s23]
     we want to assign this 21 instances to 3 RGs, a instance selection 
strategy gives:
     RG0: [s0-s7]
     RG1:  [s8-s15]
     RG2: [s16-s23]
   
     this yields a RG->pool mapping of: 
     RG0 -> pool0, pool1
     RG2 -> pool1, pool2, pool3
     RG3 -> pool3, pool4
    
   taking out 1 instances s6(poo1), s11(pool2), and s21(pool4), it will be 
[5,4,4,4,4] severs in each pool and [7, 7, 7] servers in each RG.
   Then for InstanceTagPoolSelector/InstanceReplicaGroupPartitionSelector it 
would be very hard to figure out
   (1) to which pool the lost instances belong to
   (2) where the vacant seats in the each pool
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to