J-HowHuang commented on code in PR #16341:
URL: https://github.com/apache/pinot/pull/16341#discussion_r2216647912


##########
pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/rebalance/DefaultRebalancePreChecker.java:
##########
@@ -352,35 +351,49 @@ private RebalancePreCheckerResult 
checkRebalanceConfig(RebalanceConfig rebalance
     List<String> segmentsToMove = 
SegmentAssignmentUtils.getSegmentsToMove(currentAssignment, targetAssignment);
 
     int numReplicas = Integer.MAX_VALUE;
-    if (rebalanceConfig.isDowntime() || 
PauselessConsumptionUtils.isPauselessEnabled(tableConfig)) {
+    String peerSegmentDownloadScheme = 
tableConfig.getValidationConfig().getPeerSegmentDownloadScheme();
+    if (rebalanceConfig.isDowntime() || peerSegmentDownloadScheme != null) {
       for (String segment : segmentsToMove) {
         numReplicas = Math.min(targetAssignment.get(segment).size(), 
numReplicas);
       }
     }
 
+    // For non-peer download enabled tables, warn if downtime is enabled but 
numReplicas > 1. Should only use
+    // downtime=true for such tables if downtime is indeed acceptable whereas 
for numReplicas = 1, rebalance cannot
+    // be done without downtime
     if (rebalanceConfig.isDowntime()) {
       if (!segmentsToMove.isEmpty() && numReplicas > 1) {
         pass = false;
         warnings.add("Number of replicas (" + numReplicas + ") is greater than 
1, downtime is not recommended.");
       }
     }
 
-    // It was revealed a risk of data loss for pauseless tables during 
rebalance, when downtime=true or
-    // minAvailableReplicas=0 -- If a segment is being moved and has not yet 
uploaded to deep store, premature
-    // deletion could cause irrecoverable data loss. This pre-check added as a 
workaround to warn the potential risk.
-    // TODO: Get to the root cause of the issue and revisit this pre-check.
-    if (PauselessConsumptionUtils.isPauselessEnabled(tableConfig)) {
+    // Peer download enabled tables may have data loss during rebalance, when 
downtime=true or minAvailableReplicas=0.
+    // The scenario plays out as follows:
+    // 1. If the newly built consuming segment is cannot be uploaded to deep 
store, it may set up the download URI
+    //    as an empty string: ""
+    // 2. When this happens, other servers expect to download the segment from 
a peer server that built the segment or
+    //    has a copy of the segment
+    // 3. With downtime rebalance (or if minAvailableReplicas=0), the IS may 
be updated for all the servers of a given
+    //    segment
+    // 4. The above may lead to dropping the existing segments from the 
existing servers without waiting for the newly
+    //    added servers to download the segment from the peer. In this case 
since a deep store copy does not exist,
+    //    there is no way to recover this segment without manually re-building 
it
+    // Thus, to avoid the above data loss scenario, it is not recommended to 
run downtime rebalance for peer download
+    // enabled tables. This pre-check is added to warn of the potential risk.
+    if (peerSegmentDownloadScheme != null) {
       int minAvailableReplica = rebalanceConfig.getMinAvailableReplicas();
       if (minAvailableReplica < 0) {
         minAvailableReplica = numReplicas + minAvailableReplica;
       }
       if (numReplicas == 1) {

Review Comment:
   No. For example if the original replica was 2, and the new replica is now 1, 
it can never be rebalanced without downtime (i.e. have to either set 
`downtime=true` or `minAvailableReplicas=0`). But not vice versa.
   
   If the original was 1 and changed to 2:
   ```
   Current assignment
     server_1: ONLINE
   
   Target assignment
     server_2: ONLINE
     server_3: ONLINE
   ```
   
   ```
   Step 1
     server_1: ONLINE
     server_2: ONLINE
   
   Step 2
     server_2: ONLINE
     server_3: ONLINE
   ```
   
   There's always one instance unmoved and thus available. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to