Yu Ting Su created HBASE-30048:
----------------------------------

             Summary: Regions permanently stuck in OPENING after master restart 
due to AssignRegionHandler silently skipping OPENED report for already-online 
region
                 Key: HBASE-30048
                 URL: https://issues.apache.org/jira/browse/HBASE-30048
             Project: HBase
          Issue Type: Bug
          Components: amv2
    Affects Versions: 2.6.2
            Reporter: Yu Ting Su


When the HBase master restarts while region open procedures are in flight, 
regions can become permanently stuck in OPENING state (
Region-In-Transition) with no automatic recovery. The root cause is a race 
condition between master WAL replay and RegionServer 
duplicate procedure handling, with two distinct failure paths depending on 
timing.

The bug mechanism is that during normal operation, the master creates an 
OpenRegionProcedure, dispatches it to a RegionServer, and persists it to the 
procedure WAL in DISPATCH state. On master restart, the new master replays the 
WAL, sets the procedure back to RUNNABLE, and re-
dispatches it to the same RegionServer with the same procedure ID.
 # Two failure paths exist depending on whether the original open has completed:
Region open still in progress (PID rejection): The RegionServer still has the 
procedure ID in submittedRegionProcedures. It rejects the re-dispatch with 
"already submitted, just ignore it" and sends no report to the new master. If 
the original open eventually completes and reports OPENED, the new master 
receives it and the procedure recovers. However, if the original open fails, 
finishRegionProcedure() is never called on the failure path 
(cleanUpAndReportFailure()), leaving the PID in the map permanently and 
blocking any future retry.
 # Region open already completed (silent drop — primary failure path): The 
original open finished before the master restart. The RegionServer already 
called finishRegionProcedure() (removing the PID) and reported OPENED to the 
old (dead) master. The RS retry loop for reporting OPENED exhausted before the 
new master was ready. When the new master re-dispatches, a new 
AssignRegionHandler is created, but it finds the region already in the online 
regions map and silently returns without reporting OPENED to the new master 
(See AssignRegionHandler.java). The OpenRegionProcedure on the master is 
suspended forever waiting for a report that will never come.

I reviewed the codes, and the root cause code is in AssignRegionHandler.java:

 

 
{code:java}
Region onlineRegion = rs.getRegion(encodedName);
if (onlineRegion != null) {
    LOG.warn("Received OPEN for {} which is already online", regionName);
    // Just follow the old behavior, do we need to call 
reportRegionStateTransition?
    // Maybe not? For normal case, it could happen that the rpc call to schedule
    // this handler is succeeded, but before returning to master the connection 
is
    // broken. And when master tries again, we have already finished the 
opening.
    // For this case we do not need to call reportRegionStateTransition any 
more.
    return;
} {code}
 

 

The code assumes the master already knows the region is open. This holds when 
the same master retries, but breaks after a master failover — the new master 
never received the OPENED report.

A contributing factor is that OpenRegionProcedure returns false from 
storeInDispatchedQueue(), so the dispatcher has no memory of having already 
sent it across restarts.

So, the design tension is that the master and RegionServer make opposite 
assumptions:
 * Master: "The RegionServer will always report back, so I don't need to track 
dispatch"
 * RegionServer: "If the region is already online, the master must already know"

Neither is wrong in isolation. The bug lives in the interaction between master 
WAL replay and RegionServer state, with no handshake to reconcile after a 
master restart.

Here's the race condition window. 
 # The window spans from the old master's death to the new master completing 
initialization. In reproduction, this was ~6 minutes 40 seconds. Any region 
whose open completed and whose RS exhausted its report-OPENED retries during 
this window is permanently stuck.
 # Fast opens are disproportionately affected — slow opens (still in progress 
when the new master comes up) recover naturally because their completion 
reports to the current (new) master.

 

I tried to have a reproduction as following: 
 # Create a cluster with many regions undergoing assignment (e.g., 
disable/re-enable a large table)
 # Kill the master process while RegionServers are actively opening regions
 # Allow the new master to start and complete initialization
 # Observe regions permanently stuck in OPENING state

 

 
{code:java}
# Create a table with 50k regions. It will take 2hr 40 min to complete the 
region assignment in my testing env. 
export nsplits=50000
export TABLE_BUG=usertable 
echo "create '${TABLE_BUG}', {NAME => 'cf1'}, {NAME => 'cf2'}, {NAME => 'cf3'}, 
{NAME => 'cf4'}, {NAME => 'cf5'}, {SPLITS => (1..$nsplits).map {|i| 
\"user#{10000+i*(99999-10000)/$nsplits}\"}}" | sudo -u hbase hbase shell -n

# Load data via PerformanceEvaluation to generate WAL across all regions — WAL 
replay on open slows down OpenRegionProcedure
echo "Loading data to generate WAL..."
cd /tmp
[ ! -d ycsb-0.17.0 ] && curl -O --location 
https://github.com/brianfrankcooper/YCSB/releases/download/0.17.0/ycsb-0.17.0.tar.gz
 && tar xfvz ycsb-0.17.0.tar.gz
cd ycsb-0.17.0

# The data size will grow ~500GB every 4 hour. If you leave it running for 12 
hours, there will be 1.5 TB of data. But compactions may let the total data 
size decrease. 
for cf in cf1 cf2 cf3 cf4 cf5; do
  nohup bash -c '
  bin/ycsb.sh load hbase20 \
    -p table='"${TABLE_BUG}"' -p columnfamily='"${cf}"' \
    -p recordcount=50000000 \
    -p fieldcount=1000 \
    -p fieldlength=5000 \
    -P workloads/workloada -threads 50
  ' > /tmp/ycsb_load-${cf}.log 2>&1 &
done

REGION_COUNT=$(echo "list_regions '${TABLE_BUG}'" | sudo -u hbase hbase shell 
-n 2>/dev/null | grep -cP '[a-f0-9]{32}')
echo "Table ${TABLE_BUG} has ${REGION_COUNT} regions"

# Disable table — creates CloseRegionProcedures for all regions
echo "Disable ${TABLE_BUG}..."
echo "disable '${TABLE_BUG}'" | sudo -u hbase hbase shell -n &
DISABLE_PID=$!

sleep 5

# Wait for CloseRegion procedures to be dispatched, then kill master
while true; do
  line=$(grep "Initialized subprocedures" 
/var/log/hbase/hbase-hbase-master-*.log | grep CloseRegionProcedure | tail -1 
2>/dev/null | tail -1)
  if [ -n "$line" ]; then
    ts=$(echo "$line" | grep -oP "^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}" | tr 
"T" " ")
    echo $ts
    if [ -n "$ts" ]; then
      log_epoch=$(date -d "$ts" +%s 2>/dev/null)
      now_epoch=$(date +%s)
      diff=$((now_epoch - log_epoch))
      if [ "$diff" -ge 0 ] && [ "$diff" -le 10 ]; then
        echo "$(date): CloseRegion dispatched ${diff}s ago — killing HMaster"
        master_pid=$(pgrep -f proc_master)
        [ -n "$master_pid" ] && sudo kill -9 $master_pid
        break
      fi
    fi
  fi
  sleep 1
done
wait $DISABLE_PID 2>/dev/null

echo "Waiting 90s for master to restart..."
sleep 90

sudo systemctl stop hbase-master
sudo -u hdfs hdfs dfs -mkdir -p /user/hbase/WAL/MasterData/backup-WALs/
sudo -u hdfs hdfs dfs -cp /user/hbase/WAL/MasterData/WALs/ 
/user/hbase/WAL/MasterData/backup-WALs/
sudo -u hdfs hdfs dfs -rm 
/user/hbase/WAL/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682/recovered.wals/
aws s3 cp 
s3://my/S3/path/hbase/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682/proc/
 
s3://my/S3/path/backup-hbase/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682/proc/
 --recursive
aws s3 rm 
s3://my/S3/path/hbase/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682/proc/
 --recursive
sudo -u hdfs hdfs dfs -ls /user/hbase/WAL/MasterData/WALs/ | grep dead 
echo "deleteall /hbase/master" | sudo -u hbase hbase zkcli
sudo systemctl start hbase-master

sudo -u hbase hbase hbck -j /usr/lib/hbase-operator-tools/hbase-hbck2.jar 
setTableState usertable DISABLED

# Now enable — creates OpenRegionProcedures; some PIDs may collide with 
in-flight ones
echo "Enabling ${TABLE_BUG}..."
echo "enable '${TABLE_BUG}'" | sudo -u hbase hbase shell -n
ENABLE_PID=$!

# Wait for OpenRegion procedures to be dispatched, then kill master again
while true; do
  line=$(grep "Initialized subprocedures" 
/var/log/hbase/hbase-hbase-master-*.log | grep OpenRegionProcedure | tail -1 
2>/dev/null | tail -1)
  if [ -n "$line" ]; then
    ts=$(echo "$line" | grep -oP "^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}" | tr 
"T" " ")
    echo $ts
    if [ -n "$ts" ]; then
      log_epoch=$(date -d "$ts" +%s 2>/dev/null)
      now_epoch=$(date +%s)
      diff=$((now_epoch - log_epoch))
      if [ "$diff" -ge 0 ] && [ "$diff" -le 10 ]; then
        echo "$(date): OpenRegion dispatched ${diff}s ago — killing HMaster"
        master_pid=$(pgrep -f proc_master)
        [ -n "$master_pid" ] && sudo kill -9 $master_pid
        break
      fi
    fi
  fi
  sleep 1
done
wait $ENABLE_PID 2>/dev/null{code}
 

 

>From reproduction, I observed the following impact: 
 * 11,340 regions permanently stuck in OPENING out of ~41,500 in transition
 * Balancer completely blocked
 * STUCK RIT count plateaued for 12+ minutes with zero recovery
 * 99.8% of stuck regions confirmed "already online" on the RegionServer
 * Zero FAILED_OPEN events — all opens succeeded, reports were lost
 * Requires manual intervention (HBCK2 assigns or another master restart) to 
clear

 

I fixed the issue by updating RSRpcServices.java and AssignRegionHandler.java 
as following. The issue went away after deploying the fix. 

 
{code:java}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java
index 0e08c7dc17..8aeba1f90f 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java
@@ -3961,6 +3961,17 @@ public class RSRpcServices implements 
HBaseRPCErrorHandler, AdminService.Blockin
       if (regionServer.submitRegionProcedure(procId)) {
         
regionServer.getExecutorService().submit(AssignRegionHandler.create(regionServer,
           regionInfo, procId, tableDesc, masterSystemTime, 
initiatingMasterActiveTime));
+      } else {
+        // SIM V2117255161: Even though this PID was already submitted or 
executed, we still need
+        // to check if the region is already online and report OPENED to the 
master. After a
+        // master restart, the new master re-dispatches procedures from the 
WAL. If the region
+        // was opened and reported to the old (dead) master, the new master 
doesn't know.
+        // Submit the handler anyway so it can detect the "already online" 
case and report back.
+        LOG.info("Submitting AssignRegionHandler for duplicate pid={}, 
region={} to handle "
+          + "potential already-online region after master restart",
+          procId, regionInfo.getRegionNameAsString());
+        
regionServer.getExecutorService().submit(AssignRegionHandler.create(regionServer,
+          regionInfo, procId, tableDesc, masterSystemTime, 
initiatingMasterActiveTime));
       }
     }
   }
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
index d61f2de30f..395a94b1ef 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
@@ -106,12 +106,26 @@ public class AssignRegionHandler extends EventHandler {
     String regionName = regionInfo.getRegionNameAsString();
     Region onlineRegion = rs.getRegion(encodedName);
     if (onlineRegion != null) {
-      LOG.warn("Received OPEN for {} which is already online", regionName);
-      // Just follow the old behavior, do we need to call 
reportRegionStateTransition? Maybe not?
-      // For normal case, it could happen that the rpc call to schedule this 
handler is succeeded,
-      // but before returning to master the connection is broken. And when 
master tries again, we
-      // have already finished the opening. For this case we do not need to 
call
-      // reportRegionStateTransition any more.
+      LOG.warn("Received OPEN for {} which is already online, pid={}, "
+        + "initiatingMasterActiveTime={}", regionName, openProcId, 
initiatingMasterActiveTime);
+      // SIM V2117255161: After a master restart, the new master re-dispatches 
open procedures
+      // recovered from the WAL. If the region was already opened and reported 
OPENED to the
+      // old (dead) master, the new master has no record of it. We must report 
OPENED again
+      // so the new master can complete the procedure. Without this, the 
region is stuck in
+      // OPENING state forever (permanent RIT).
+      long openSeqNum = onlineRegion.getMaxFlushedSeqId();
+      if (openSeqNum == HConstants.NO_SEQNUM) {
+        openSeqNum = 0;
+      }
+      LOG.info("Reporting OPENED for already-online region {} to master, 
pid={}, openSeqNum={}",
+        regionName, openProcId, openSeqNum);
+      if (
+        !rs.reportRegionStateTransition(new 
RegionStateTransitionContext(TransitionCode.OPENED,
+          openSeqNum, openProcId, masterSystemTime, regionInfo, 
initiatingMasterActiveTime))
+      ) {
+        LOG.warn("Failed to report OPENED for already-online region {} to 
master, pid={}",
+          regionName, openProcId);
+      }
       return;
     }
     Boolean previous = 
rs.getRegionsInTransitionInRS().putIfAbsent(encodedNameBytes, Boolean.TRUE);
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to