Yu Ting Su created HBASE-30048:
----------------------------------
Summary: Regions permanently stuck in OPENING after master restart
due to AssignRegionHandler silently skipping OPENED report for already-online
region
Key: HBASE-30048
URL: https://issues.apache.org/jira/browse/HBASE-30048
Project: HBase
Issue Type: Bug
Components: amv2
Affects Versions: 2.6.2
Reporter: Yu Ting Su
When the HBase master restarts while region open procedures are in flight,
regions can become permanently stuck in OPENING state (
Region-In-Transition) with no automatic recovery. The root cause is a race
condition between master WAL replay and RegionServer
duplicate procedure handling, with two distinct failure paths depending on
timing.
The bug mechanism is that during normal operation, the master creates an
OpenRegionProcedure, dispatches it to a RegionServer, and persists it to the
procedure WAL in DISPATCH state. On master restart, the new master replays the
WAL, sets the procedure back to RUNNABLE, and re-
dispatches it to the same RegionServer with the same procedure ID.
# Two failure paths exist depending on whether the original open has completed:
Region open still in progress (PID rejection): The RegionServer still has the
procedure ID in submittedRegionProcedures. It rejects the re-dispatch with
"already submitted, just ignore it" and sends no report to the new master. If
the original open eventually completes and reports OPENED, the new master
receives it and the procedure recovers. However, if the original open fails,
finishRegionProcedure() is never called on the failure path
(cleanUpAndReportFailure()), leaving the PID in the map permanently and
blocking any future retry.
# Region open already completed (silent drop — primary failure path): The
original open finished before the master restart. The RegionServer already
called finishRegionProcedure() (removing the PID) and reported OPENED to the
old (dead) master. The RS retry loop for reporting OPENED exhausted before the
new master was ready. When the new master re-dispatches, a new
AssignRegionHandler is created, but it finds the region already in the online
regions map and silently returns without reporting OPENED to the new master
(See AssignRegionHandler.java). The OpenRegionProcedure on the master is
suspended forever waiting for a report that will never come.
I reviewed the codes, and the root cause code is in AssignRegionHandler.java:
{code:java}
Region onlineRegion = rs.getRegion(encodedName);
if (onlineRegion != null) {
LOG.warn("Received OPEN for {} which is already online", regionName);
// Just follow the old behavior, do we need to call
reportRegionStateTransition?
// Maybe not? For normal case, it could happen that the rpc call to schedule
// this handler is succeeded, but before returning to master the connection
is
// broken. And when master tries again, we have already finished the
opening.
// For this case we do not need to call reportRegionStateTransition any
more.
return;
} {code}
The code assumes the master already knows the region is open. This holds when
the same master retries, but breaks after a master failover — the new master
never received the OPENED report.
A contributing factor is that OpenRegionProcedure returns false from
storeInDispatchedQueue(), so the dispatcher has no memory of having already
sent it across restarts.
So, the design tension is that the master and RegionServer make opposite
assumptions:
* Master: "The RegionServer will always report back, so I don't need to track
dispatch"
* RegionServer: "If the region is already online, the master must already know"
Neither is wrong in isolation. The bug lives in the interaction between master
WAL replay and RegionServer state, with no handshake to reconcile after a
master restart.
Here's the race condition window.
# The window spans from the old master's death to the new master completing
initialization. In reproduction, this was ~6 minutes 40 seconds. Any region
whose open completed and whose RS exhausted its report-OPENED retries during
this window is permanently stuck.
# Fast opens are disproportionately affected — slow opens (still in progress
when the new master comes up) recover naturally because their completion
reports to the current (new) master.
I tried to have a reproduction as following:
# Create a cluster with many regions undergoing assignment (e.g.,
disable/re-enable a large table)
# Kill the master process while RegionServers are actively opening regions
# Allow the new master to start and complete initialization
# Observe regions permanently stuck in OPENING state
{code:java}
# Create a table with 50k regions. It will take 2hr 40 min to complete the
region assignment in my testing env.
export nsplits=50000
export TABLE_BUG=usertable
echo "create '${TABLE_BUG}', {NAME => 'cf1'}, {NAME => 'cf2'}, {NAME => 'cf3'},
{NAME => 'cf4'}, {NAME => 'cf5'}, {SPLITS => (1..$nsplits).map {|i|
\"user#{10000+i*(99999-10000)/$nsplits}\"}}" | sudo -u hbase hbase shell -n
# Load data via PerformanceEvaluation to generate WAL across all regions — WAL
replay on open slows down OpenRegionProcedure
echo "Loading data to generate WAL..."
cd /tmp
[ ! -d ycsb-0.17.0 ] && curl -O --location
https://github.com/brianfrankcooper/YCSB/releases/download/0.17.0/ycsb-0.17.0.tar.gz
&& tar xfvz ycsb-0.17.0.tar.gz
cd ycsb-0.17.0
# The data size will grow ~500GB every 4 hour. If you leave it running for 12
hours, there will be 1.5 TB of data. But compactions may let the total data
size decrease.
for cf in cf1 cf2 cf3 cf4 cf5; do
nohup bash -c '
bin/ycsb.sh load hbase20 \
-p table='"${TABLE_BUG}"' -p columnfamily='"${cf}"' \
-p recordcount=50000000 \
-p fieldcount=1000 \
-p fieldlength=5000 \
-P workloads/workloada -threads 50
' > /tmp/ycsb_load-${cf}.log 2>&1 &
done
REGION_COUNT=$(echo "list_regions '${TABLE_BUG}'" | sudo -u hbase hbase shell
-n 2>/dev/null | grep -cP '[a-f0-9]{32}')
echo "Table ${TABLE_BUG} has ${REGION_COUNT} regions"
# Disable table — creates CloseRegionProcedures for all regions
echo "Disable ${TABLE_BUG}..."
echo "disable '${TABLE_BUG}'" | sudo -u hbase hbase shell -n &
DISABLE_PID=$!
sleep 5
# Wait for CloseRegion procedures to be dispatched, then kill master
while true; do
line=$(grep "Initialized subprocedures"
/var/log/hbase/hbase-hbase-master-*.log | grep CloseRegionProcedure | tail -1
2>/dev/null | tail -1)
if [ -n "$line" ]; then
ts=$(echo "$line" | grep -oP "^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}" | tr
"T" " ")
echo $ts
if [ -n "$ts" ]; then
log_epoch=$(date -d "$ts" +%s 2>/dev/null)
now_epoch=$(date +%s)
diff=$((now_epoch - log_epoch))
if [ "$diff" -ge 0 ] && [ "$diff" -le 10 ]; then
echo "$(date): CloseRegion dispatched ${diff}s ago — killing HMaster"
master_pid=$(pgrep -f proc_master)
[ -n "$master_pid" ] && sudo kill -9 $master_pid
break
fi
fi
fi
sleep 1
done
wait $DISABLE_PID 2>/dev/null
echo "Waiting 90s for master to restart..."
sleep 90
sudo systemctl stop hbase-master
sudo -u hdfs hdfs dfs -mkdir -p /user/hbase/WAL/MasterData/backup-WALs/
sudo -u hdfs hdfs dfs -cp /user/hbase/WAL/MasterData/WALs/
/user/hbase/WAL/MasterData/backup-WALs/
sudo -u hdfs hdfs dfs -rm
/user/hbase/WAL/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682/recovered.wals/
aws s3 cp
s3://my/S3/path/hbase/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682/proc/
s3://my/S3/path/backup-hbase/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682/proc/
--recursive
aws s3 rm
s3://my/S3/path/hbase/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682/proc/
--recursive
sudo -u hdfs hdfs dfs -ls /user/hbase/WAL/MasterData/WALs/ | grep dead
echo "deleteall /hbase/master" | sudo -u hbase hbase zkcli
sudo systemctl start hbase-master
sudo -u hbase hbase hbck -j /usr/lib/hbase-operator-tools/hbase-hbck2.jar
setTableState usertable DISABLED
# Now enable — creates OpenRegionProcedures; some PIDs may collide with
in-flight ones
echo "Enabling ${TABLE_BUG}..."
echo "enable '${TABLE_BUG}'" | sudo -u hbase hbase shell -n
ENABLE_PID=$!
# Wait for OpenRegion procedures to be dispatched, then kill master again
while true; do
line=$(grep "Initialized subprocedures"
/var/log/hbase/hbase-hbase-master-*.log | grep OpenRegionProcedure | tail -1
2>/dev/null | tail -1)
if [ -n "$line" ]; then
ts=$(echo "$line" | grep -oP "^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}" | tr
"T" " ")
echo $ts
if [ -n "$ts" ]; then
log_epoch=$(date -d "$ts" +%s 2>/dev/null)
now_epoch=$(date +%s)
diff=$((now_epoch - log_epoch))
if [ "$diff" -ge 0 ] && [ "$diff" -le 10 ]; then
echo "$(date): OpenRegion dispatched ${diff}s ago — killing HMaster"
master_pid=$(pgrep -f proc_master)
[ -n "$master_pid" ] && sudo kill -9 $master_pid
break
fi
fi
fi
sleep 1
done
wait $ENABLE_PID 2>/dev/null{code}
>From reproduction, I observed the following impact:
* 11,340 regions permanently stuck in OPENING out of ~41,500 in transition
* Balancer completely blocked
* STUCK RIT count plateaued for 12+ minutes with zero recovery
* 99.8% of stuck regions confirmed "already online" on the RegionServer
* Zero FAILED_OPEN events — all opens succeeded, reports were lost
* Requires manual intervention (HBCK2 assigns or another master restart) to
clear
I fixed the issue by updating RSRpcServices.java and AssignRegionHandler.java
as following. The issue went away after deploying the fix.
{code:java}
diff --git
a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java
b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java
index 0e08c7dc17..8aeba1f90f 100644
---
a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java
+++
b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java
@@ -3961,6 +3961,17 @@ public class RSRpcServices implements
HBaseRPCErrorHandler, AdminService.Blockin
if (regionServer.submitRegionProcedure(procId)) {
regionServer.getExecutorService().submit(AssignRegionHandler.create(regionServer,
regionInfo, procId, tableDesc, masterSystemTime,
initiatingMasterActiveTime));
+ } else {
+ // SIM V2117255161: Even though this PID was already submitted or
executed, we still need
+ // to check if the region is already online and report OPENED to the
master. After a
+ // master restart, the new master re-dispatches procedures from the
WAL. If the region
+ // was opened and reported to the old (dead) master, the new master
doesn't know.
+ // Submit the handler anyway so it can detect the "already online"
case and report back.
+ LOG.info("Submitting AssignRegionHandler for duplicate pid={},
region={} to handle "
+ + "potential already-online region after master restart",
+ procId, regionInfo.getRegionNameAsString());
+
regionServer.getExecutorService().submit(AssignRegionHandler.create(regionServer,
+ regionInfo, procId, tableDesc, masterSystemTime,
initiatingMasterActiveTime));
}
}
}
diff --git
a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
index d61f2de30f..395a94b1ef 100644
---
a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
+++
b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/AssignRegionHandler.java
@@ -106,12 +106,26 @@ public class AssignRegionHandler extends EventHandler {
String regionName = regionInfo.getRegionNameAsString();
Region onlineRegion = rs.getRegion(encodedName);
if (onlineRegion != null) {
- LOG.warn("Received OPEN for {} which is already online", regionName);
- // Just follow the old behavior, do we need to call
reportRegionStateTransition? Maybe not?
- // For normal case, it could happen that the rpc call to schedule this
handler is succeeded,
- // but before returning to master the connection is broken. And when
master tries again, we
- // have already finished the opening. For this case we do not need to
call
- // reportRegionStateTransition any more.
+ LOG.warn("Received OPEN for {} which is already online, pid={}, "
+ + "initiatingMasterActiveTime={}", regionName, openProcId,
initiatingMasterActiveTime);
+ // SIM V2117255161: After a master restart, the new master re-dispatches
open procedures
+ // recovered from the WAL. If the region was already opened and reported
OPENED to the
+ // old (dead) master, the new master has no record of it. We must report
OPENED again
+ // so the new master can complete the procedure. Without this, the
region is stuck in
+ // OPENING state forever (permanent RIT).
+ long openSeqNum = onlineRegion.getMaxFlushedSeqId();
+ if (openSeqNum == HConstants.NO_SEQNUM) {
+ openSeqNum = 0;
+ }
+ LOG.info("Reporting OPENED for already-online region {} to master,
pid={}, openSeqNum={}",
+ regionName, openProcId, openSeqNum);
+ if (
+ !rs.reportRegionStateTransition(new
RegionStateTransitionContext(TransitionCode.OPENED,
+ openSeqNum, openProcId, masterSystemTime, regionInfo,
initiatingMasterActiveTime))
+ ) {
+ LOG.warn("Failed to report OPENED for already-online region {} to
master, pid={}",
+ regionName, openProcId);
+ }
return;
}
Boolean previous =
rs.getRegionsInTransitionInRS().putIfAbsent(encodedNameBytes, Boolean.TRUE);
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)