[
https://issues.apache.org/jira/browse/HBASE-29364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhiwen Deng updated HBASE-29364:
--------------------------------
Description:
We have encountered multiple cases where regions were opened on RegionServers
(RS) that had already been offlined. It wasn't until recently that we
discovered a potential cause for this issue, and the details of the problem are
as follows:
Our HDFS storage reached the online level, which caused the upper-level master
and rs to be unable to write and abort. Finally, we manually accessed and
deleted some data, and HDFS was restored. Then the hbck report showed that some
regions were opened on the offline rs, which caused these regions to be unable
to server. We finally used hbck2 to assigns these regions, and the problem was
solved.
h3. # The Problem
Here is the analysis of the region transition for one specific region:
19f709990ad65ce3d51ddeaf29acf436:
2025-05-21, 05:48:11 : The region was assigned to
{+}rs-hostname,20700,1747777624803{+}, but due to some anomalies, it could not
be opened on the target RS. Finally, the RS reported the open result to the
Master:
{code:java}
2025-05-21,05:48:11,646 INFO
[RpcServer.priority.RWQ.Fifo.write.handler=2,queue=0,port=20600]
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase: Received
report from rs-hostname,20700,1747777624803, transitionCode=FAILED_OPEN,
seqId=-1, regionNode=state=OPENING, location=rs-hostname,20700,1747777624803,
table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, proc=pid=78499,
ppid=78034, state=RUNNABLE;
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure {code}
2025-05-21, 05:48:27 : rs-hostname,20700,1747777624803 went offline.
{code:java}
2025-05-21, 05:48:27,981 INFO [KeepAlivePEWorker-65]
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=78671,
state=SUCCESS; ServerCrashProcedure server=rs-hostname,20700,1747777624803,
splitWal=true, meta=false in 16.0720 sec {code}
2025-05-21, 05:49:12,312 : The RS hosting the hbase:meta table also encountered
a failure, making the meta table unavailable, which caused the above region to
get stuck in the RIT (Region-In-Transition) process.
{code:java}
2025-05-21, 05:49:12,312 WARN [ProcExecTimeout]
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
Region-In-Transition state=OPENING, location=rs-hostname,20700,1747777624803,
table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 {code}
Due to the HDFS failure, the Master also performed an abort action. The new
active Master continued to execute the previously incomplete Procedure2.
{code:java}
2025-05-21, 06:02:38,423 INFO [master/master-hostname:20600:becomeActiveMaster]
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock
for pid=78034, ppid=77973,
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED;
TransitRegionStateProcedure table=test:xxx,
region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN
2025-05-21, 06:02:38,572 INFO [master/master-hostname:becomeActiveMaster]
org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach pid=78034,
ppid=77973, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED;
TransitRegionStateProcedure table=test:xxx,
region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN to state=OFFLINE,
location=null, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 to
restore RIT {code}
When the Master switches, it modifies the region's state in the meta table
based on the procedure's status
{code:java}
2025-05-21, 06:07:52,433 INFO [master/master-hostname:20600:becomeActiveMaster]
org.apache.hadoop.hbase.master.assignment.RegionStateStore: Load hbase:meta
entry region=19f709990ad65ce3d51ddeaf29acf436, regionState=OPENING,
lastHost=rs-hostname-last,20700,1747776391310,
regionLocation=rs-hostname,20700,1747777624803, openSeqNum=174628702
2025-05-21, 06:07:52,433 WARN [master/master-hostname:20600:becomeActiveMaster]
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure: Received report
OPENED transition from rs-hostname,20700,1747777624803 for state=OPENING,
location=rs-hostname,20700,1747777624803, table=test:xxx,
region=19f709990ad65ce3d51ddeaf29acf436, pid=78499 but the new openSeqNum -1 is
less than the current one 174628702, ignoring...
{code}
I reviewed the relevant code and found that at this point, the region's state
in the Master's memory is changed to OPENED, and with the state transition of
RegionRemoteProcedureBase, it is persisted in the meta table.
{code:java}
void stateLoaded(AssignmentManager am, RegionStateNode regionNode){
if (state ==
RegionRemoteProcedureBaseState.REGION_REMOTE_PROCEDURE_REPORT_SUCCEED) {
try {
restoreSucceedState(am, regionNode, seqId);
} catch (IOException e) {
// should not happen as we are just restoring the state
throw new AssertionError(e);
}
}
}
@Override
protected void restoreSucceedState(AssignmentManager am, RegionStateNode
regionNode,
long openSeqNum) throws IOException {
if (regionNode.getState() == State.OPEN) {
// should have already been persisted, ignore
return;
}
regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED,
openSeqNum);
}{code}
Therefore, a failed open state was persisted to the meta table as OPEN, and
because the RS had already completed the SCP, its region would not be processed
again.
{code:java}
2025-05-21, 06:07:53,138 INFO [PEWorker-56]
org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=78034 updating
hbase:meta row=19f709990ad65ce3d51ddeaf29acf436, regionState=OPEN,
repBarrier=174628702, openSeqNum=174628702,
regionLocation=rs-hostname,20700,1747777624803 {code}
h3. # How to fix
We can refer to the logic when the Master does not switch, where if the open
fails, the region's state is not modified, thereby preventing the above process
from occurring.
{code:java}
@Override
protected void updateTransitionWithoutPersistingToMeta(MasterProcedureEnv env,
RegionStateNode regionNode, TransitionCode transitionCode, long openSeqNum)
throws IOException {
if (transitionCode == TransitionCode.OPENED) {
regionOpenedWithoutPersistingToMeta(env.getAssignmentManager(), regionNode,
transitionCode,
openSeqNum);
} else {
assert transitionCode == TransitionCode.FAILED_OPEN;
// will not persist to meta if giveUp is false
env.getAssignmentManager().regionFailedOpen(regionNode, false);
}
} {code}
So, we only need to add a check in restoreSucceedState:
{code:java}
@Override
protected void restoreSucceedState(AssignmentManager am, RegionStateNode
regionNode,
long openSeqNum) throws IOException {
if (regionNode.getState() == State.OPEN) {
// should have already been persisted, ignore
return;
}
regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED,
openSeqNum);
if (super.transitionCode == TransitionCode.OPENED) {
regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED,
openSeqNum);
}
// if status is not OPENED, will not change regionState.
// otherwise region may be opened on an expired region server.
} {code}
I want to know whether my fix is OK, if so I will submit a patch to fix it.
was:
We have encountered multiple instances where regions were opened on
RegionServers (RS) that had already been offlined. It wasn't until recently
that we discovered a potential cause for this issue, and the details of the
problem are as follows:
Our HDFS storage reached the online level, which caused the upper-level master
and rs to be unable to write and abort. Finally, we manually accessed and
deleted some data, and HDFS was restored. Then the hbck report showed that some
regions were opened on the offline rs, which caused these regions to be unable
to server. We finally used hbck2 to assigns these regions, and the problem was
solved.
h3. # The Problem
Here is the analysis of the region transition for one specific region:
19f709990ad65ce3d51ddeaf29acf436:
2025-05-21, 05:48:11 : The region was assigned to
{+}rs-hostname,20700,1747777624803{+}, but due to some anomalies, it could not
be opened on the target RS. Finally, the RS reported the open result to the
Master:
{code:java}
2025-05-21,05:48:11,646 INFO
[RpcServer.priority.RWQ.Fifo.write.handler=2,queue=0,port=20600]
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase: Received
report from rs-hostname,20700,1747777624803, transitionCode=FAILED_OPEN,
seqId=-1, regionNode=state=OPENING, location=rs-hostname,20700,1747777624803,
table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, proc=pid=78499,
ppid=78034, state=RUNNABLE;
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure {code}
2025-05-21, 05:48:27 : rs-hostname,20700,1747777624803 went offline.
{code:java}
2025-05-21, 05:48:27,981 INFO [KeepAlivePEWorker-65]
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=78671,
state=SUCCESS; ServerCrashProcedure server=rs-hostname,20700,1747777624803,
splitWal=true, meta=false in 16.0720 sec {code}
2025-05-21, 05:49:12,312 : The RS hosting the hbase:meta table also encountered
a failure, making the meta table unavailable, which caused the above region to
get stuck in the RIT (Region-In-Transition) process.
{code:java}
2025-05-21, 05:49:12,312 WARN [ProcExecTimeout]
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
Region-In-Transition state=OPENING, location=rs-hostname,20700,1747777624803,
table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 {code}
Due to the HDFS failure, the Master also performed an abort action. The new
active Master continued to execute the previously incomplete Procedure2.
{code:java}
2025-05-21, 06:02:38,423 INFO [master/master-hostname:20600:becomeActiveMaster]
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock
for pid=78034, ppid=77973,
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED;
TransitRegionStateProcedure table=test:xxx,
region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN
2025-05-21, 06:02:38,572 INFO [master/master-hostname:becomeActiveMaster]
org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach pid=78034,
ppid=77973, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED;
TransitRegionStateProcedure table=test:xxx,
region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN to state=OFFLINE,
location=null, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 to
restore RIT {code}
When the Master switches, it modifies the region's state in the meta table
based on the procedure's status
{code:java}
2025-05-21, 06:07:52,433 INFO [master/master-hostname:20600:becomeActiveMaster]
org.apache.hadoop.hbase.master.assignment.RegionStateStore: Load hbase:meta
entry region=19f709990ad65ce3d51ddeaf29acf436, regionState=OPENING,
lastHost=rs-hostname-last,20700,1747776391310,
regionLocation=rs-hostname,20700,1747777624803, openSeqNum=174628702
2025-05-21, 06:07:52,433 WARN [master/master-hostname:20600:becomeActiveMaster]
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure: Received report
OPENED transition from rs-hostname,20700,1747777624803 for state=OPENING,
location=rs-hostname,20700,1747777624803, table=test:xxx,
region=19f709990ad65ce3d51ddeaf29acf436, pid=78499 but the new openSeqNum -1 is
less than the current one 174628702, ignoring...
{code}
I reviewed the relevant code and found that at this point, the region's state
in the Master's memory is changed to OPENED, and with the state transition of
RegionRemoteProcedureBase, it is persisted in the meta table.
{code:java}
void stateLoaded(AssignmentManager am, RegionStateNode regionNode){
if (state ==
RegionRemoteProcedureBaseState.REGION_REMOTE_PROCEDURE_REPORT_SUCCEED) {
try {
restoreSucceedState(am, regionNode, seqId);
} catch (IOException e) {
// should not happen as we are just restoring the state
throw new AssertionError(e);
}
}
}
@Override
protected void restoreSucceedState(AssignmentManager am, RegionStateNode
regionNode,
long openSeqNum) throws IOException {
if (regionNode.getState() == State.OPEN) {
// should have already been persisted, ignore
return;
}
regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED,
openSeqNum);
}{code}
Therefore, a failed open state was persisted to the meta table as OPEN, and
because the RS had already completed the SCP, its region would not be processed
again.
{code:java}
2025-05-21, 06:07:53,138 INFO [PEWorker-56]
org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=78034 updating
hbase:meta row=19f709990ad65ce3d51ddeaf29acf436, regionState=OPEN,
repBarrier=174628702, openSeqNum=174628702,
regionLocation=rs-hostname,20700,1747777624803 {code}
h3. # How to fix
We can refer to the logic when the Master does not switch, where if the open
fails, the region's state is not modified, thereby preventing the above process
from occurring.
{code:java}
@Override
protected void updateTransitionWithoutPersistingToMeta(MasterProcedureEnv env,
RegionStateNode regionNode, TransitionCode transitionCode, long openSeqNum)
throws IOException {
if (transitionCode == TransitionCode.OPENED) {
regionOpenedWithoutPersistingToMeta(env.getAssignmentManager(), regionNode,
transitionCode,
openSeqNum);
} else {
assert transitionCode == TransitionCode.FAILED_OPEN;
// will not persist to meta if giveUp is false
env.getAssignmentManager().regionFailedOpen(regionNode, false);
}
} {code}
So, we only need to add a check in restoreSucceedState:
{code:java}
@Override
protected void restoreSucceedState(AssignmentManager am, RegionStateNode
regionNode,
long openSeqNum) throws IOException {
if (regionNode.getState() == State.OPEN) {
// should have already been persisted, ignore
return;
}
regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED,
openSeqNum);
if (super.transitionCode == TransitionCode.OPENED) {
regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED,
openSeqNum);
}
// if status is not OPENED, will not change regionState.
// otherwise region may be opened on an expired region server.
} {code}
I want to know whether my fix is OK, if so I will submit a patch to fix it.
> Region will be opened in unknown regionserver when master is changed & rs
> crashed
> ---------------------------------------------------------------------------------
>
> Key: HBASE-29364
> URL: https://issues.apache.org/jira/browse/HBASE-29364
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Affects Versions: 2.3.0
> Reporter: Zhiwen Deng
> Priority: Major
>
> We have encountered multiple cases where regions were opened on RegionServers
> (RS) that had already been offlined. It wasn't until recently that we
> discovered a potential cause for this issue, and the details of the problem
> are as follows:
> Our HDFS storage reached the online level, which caused the upper-level
> master and rs to be unable to write and abort. Finally, we manually accessed
> and deleted some data, and HDFS was restored. Then the hbck report showed
> that some regions were opened on the offline rs, which caused these regions
> to be unable to server. We finally used hbck2 to assigns these regions, and
> the problem was solved.
> h3. # The Problem
> Here is the analysis of the region transition for one specific region:
> 19f709990ad65ce3d51ddeaf29acf436:
> 2025-05-21, 05:48:11 : The region was assigned to
> {+}rs-hostname,20700,1747777624803{+}, but due to some anomalies, it could
> not be opened on the target RS. Finally, the RS reported the open result to
> the Master:
> {code:java}
> 2025-05-21,05:48:11,646 INFO
> [RpcServer.priority.RWQ.Fifo.write.handler=2,queue=0,port=20600]
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase: Received
> report from rs-hostname,20700,1747777624803, transitionCode=FAILED_OPEN,
> seqId=-1, regionNode=state=OPENING, location=rs-hostname,20700,1747777624803,
> table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, proc=pid=78499,
> ppid=78034, state=RUNNABLE;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure {code}
> 2025-05-21, 05:48:27 : rs-hostname,20700,1747777624803 went offline.
> {code:java}
> 2025-05-21, 05:48:27,981 INFO [KeepAlivePEWorker-65]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=78671,
> state=SUCCESS; ServerCrashProcedure server=rs-hostname,20700,1747777624803,
> splitWal=true, meta=false in 16.0720 sec {code}
> 2025-05-21, 05:49:12,312 : The RS hosting the hbase:meta table also
> encountered a failure, making the meta table unavailable, which caused the
> above region to get stuck in the RIT (Region-In-Transition) process.
> {code:java}
> 2025-05-21, 05:49:12,312 WARN [ProcExecTimeout]
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
> Region-In-Transition state=OPENING, location=rs-hostname,20700,1747777624803,
> table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 {code}
> Due to the HDFS failure, the Master also performed an abort action. The new
> active Master continued to execute the previously incomplete Procedure2.
> {code:java}
> 2025-05-21, 06:02:38,423 INFO
> [master/master-hostname:20600:becomeActiveMaster]
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock
> for pid=78034, ppid=77973,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED;
> TransitRegionStateProcedure table=test:xxx,
> region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN
> 2025-05-21, 06:02:38,572 INFO [master/master-hostname:becomeActiveMaster]
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach
> pid=78034, ppid=77973, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED;
> TransitRegionStateProcedure table=test:xxx,
> region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN to state=OFFLINE,
> location=null, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 to
> restore RIT {code}
> When the Master switches, it modifies the region's state in the meta table
> based on the procedure's status
> {code:java}
> 2025-05-21, 06:07:52,433 INFO
> [master/master-hostname:20600:becomeActiveMaster]
> org.apache.hadoop.hbase.master.assignment.RegionStateStore: Load hbase:meta
> entry region=19f709990ad65ce3d51ddeaf29acf436, regionState=OPENING,
> lastHost=rs-hostname-last,20700,1747776391310,
> regionLocation=rs-hostname,20700,1747777624803, openSeqNum=174628702
> 2025-05-21, 06:07:52,433 WARN
> [master/master-hostname:20600:becomeActiveMaster]
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure: Received
> report OPENED transition from rs-hostname,20700,1747777624803 for
> state=OPENING, location=rs-hostname,20700,1747777624803, table=test:xxx,
> region=19f709990ad65ce3d51ddeaf29acf436, pid=78499 but the new openSeqNum -1
> is less than the current one 174628702, ignoring...
> {code}
> I reviewed the relevant code and found that at this point, the region's state
> in the Master's memory is changed to OPENED, and with the state transition of
> RegionRemoteProcedureBase, it is persisted in the meta table.
> {code:java}
> void stateLoaded(AssignmentManager am, RegionStateNode regionNode){
> if (state ==
> RegionRemoteProcedureBaseState.REGION_REMOTE_PROCEDURE_REPORT_SUCCEED) {
> try {
> restoreSucceedState(am, regionNode, seqId);
> } catch (IOException e) {
> // should not happen as we are just restoring the state
> throw new AssertionError(e);
> }
> }
> }
> @Override
> protected void restoreSucceedState(AssignmentManager am, RegionStateNode
> regionNode,
> long openSeqNum) throws IOException {
> if (regionNode.getState() == State.OPEN) {
> // should have already been persisted, ignore
> return;
> }
> regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED,
> openSeqNum);
> }{code}
> Therefore, a failed open state was persisted to the meta table as OPEN, and
> because the RS had already completed the SCP, its region would not be
> processed again.
> {code:java}
> 2025-05-21, 06:07:53,138 INFO [PEWorker-56]
> org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=78034
> updating hbase:meta row=19f709990ad65ce3d51ddeaf29acf436, regionState=OPEN,
> repBarrier=174628702, openSeqNum=174628702,
> regionLocation=rs-hostname,20700,1747777624803 {code}
> h3. # How to fix
> We can refer to the logic when the Master does not switch, where if the open
> fails, the region's state is not modified, thereby preventing the above
> process from occurring.
> {code:java}
> @Override
> protected void updateTransitionWithoutPersistingToMeta(MasterProcedureEnv env,
> RegionStateNode regionNode, TransitionCode transitionCode, long openSeqNum)
> throws IOException {
> if (transitionCode == TransitionCode.OPENED) {
> regionOpenedWithoutPersistingToMeta(env.getAssignmentManager(),
> regionNode, transitionCode,
> openSeqNum);
> } else {
> assert transitionCode == TransitionCode.FAILED_OPEN;
> // will not persist to meta if giveUp is false
> env.getAssignmentManager().regionFailedOpen(regionNode, false);
> }
> } {code}
>
> So, we only need to add a check in restoreSucceedState:
> {code:java}
> @Override
> protected void restoreSucceedState(AssignmentManager am, RegionStateNode
> regionNode,
> long openSeqNum) throws IOException {
> if (regionNode.getState() == State.OPEN) {
> // should have already been persisted, ignore
> return;
> }
> regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED,
> openSeqNum);
> if (super.transitionCode == TransitionCode.OPENED) {
> regionOpenedWithoutPersistingToMeta(am, regionNode,
> TransitionCode.OPENED, openSeqNum);
> }
> // if status is not OPENED, will not change regionState.
> // otherwise region may be opened on an expired region server.
> } {code}
> I want to know whether my fix is OK, if so I will submit a patch to fix it.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)