Re: [PR] HDFS-17722. DataNode stuck decommissioning on standby NameNode due to excess replica timing race. [hadoop]

via GitHub Tue, 03 Mar 2026 03:30:50 -0800


balodesecurity commented on PR #8295:
URL: https://github.com/apache/hadoop/pull/8295#issuecomment-3990405906


   ## Docker Integration Test Results
   
   Tested on a 3-DataNode Docker cluster (1 NameNode + 3 DataNodes, RF=3, 
balodesecurity/hadoop HDFS-17722 branch):
   
   ```
   --- Scenario 1: Clean decommission (RF=2, decom DN2) ---
     [PASS] DN2 decommissioned cleanly (RF=2)
   
   --- Scenario 2: HDFS-17722 — RF=3→2 creates EXCESS, then decom DN2 ---
     [PASS] DN2 decommissioned with EXCESS replicas present (HDFS-17722 FIX 
VERIFIED!)
     [PASS] All 3 files accessible after decommission
   
   --- Scenario 3: HDFS-17722 on DN3 (variant) ---
     [PASS] DN3 decommissioned with EXCESS replicas (HDFS-17722 fix verified on 
DN3)
   
   --- Scenario 4: Repeated decom/recommission cycles (3 rounds) ---
     [PASS] Round 1: DN2 decommissioned + recommissioned (Normal)
     [PASS] Round 2: DN2 decommissioned + recommissioned (Normal)
     [PASS] Round 3: DN2 decommissioned + recommissioned (Normal)
   
   --- Scenario 5: Data integrity after decommission ---
     [PASS] DN2 decommissioned
     [PASS] Data integrity OK: content matches
   
   Results: 0 failure(s) — ALL TESTS PASSED
   ```
   
   **Note on replicating the bug naturally**: In a single-NameNode setup the 
race does not occur naturally (the block manager processes setrep deletions 
before the decommission check runs in the same thread). The bug is specific to 
the standby NameNode path. The unit tests in 
`TestDatanodeAdminManagerIsSufficient` directly exercise `isSufficient()` with 
the exact replica counts that trigger the deadlock. The Docker tests verify no 
regression in normal decommission behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDFS-17722. DataNode stuck decommissioning on standby NameNode due to excess replica timing race. [hadoop]

Reply via email to