ConfX created HBASE-29806:
-----------------------------
Summary: master procedure executor fail restart due to NPE in
RegionRemoteProcedureBase.afterReplay()
Key: HBASE-29806
URL: https://issues.apache.org/jira/browse/HBASE-29806
Project: HBase
Issue Type: Bug
Components: master
Affects Versions: 2.6.4, 2.6.3
Reporter: ConfX
h2. Summary
`RegionRemoteProcedureBase.afterReplay()` does not handle the case where the
parent `TransitRegionStateProcedure` has already been completed
(finished/rolled back). This results in a `NullPointerException` when
attempting to attach the remote procedure to a non-existent parent during
procedure executor restart.
This causes master procedure executor to fail during restart.
h2. Root Cause
*File:*
`hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionRemoteProcedureBase.java`
*Lines 277-280 (getParent method):*
{code:java}
private TransitRegionStateProcedure getParent(MasterProcedureEnv env) {
return (TransitRegionStateProcedure)
env.getMasterServices().getMasterProcedureExecutor()
.getProcedure(getParentProcId());
} {code}
*Lines 391-393 (afterReplay method):*
{code:java}
@Override
protected void afterReplay(MasterProcedureEnv env) {
getParent(env).attachRemoteProc(this); // NPE if parent is null
} {code}
h2. Why the Bug Occurs
1. *Procedure Loading Behavior:* When `ProcedureExecutor` loads procedures
after restart:
- Finished procedures go to the `completed` map
- Unfinished procedures go to the `procedures` map
2. *getProcedure Only Checks Active Procedures:* The `getProcedure(procId)`
method at `ProcedureExecutor.java:1191-1192` only checks the `procedures` map:
{code:java}
public Procedure<TEnvironment> getProcedure(final long procId) {
return procedures.get(procId); // Does not check 'completed' map
}{code}
3. *Race Condition During Rollback:* During rollback of a
`ServerCrashProcedure`:
- The parent `TransitRegionStateProcedure` may complete/finish (rollback
successfully)
- Its child `RegionRemoteProcedureBase` may still be pending in the store
- On restart, the parent is in `completed` map, child is in `procedures` map
- `afterReplay()` is called on the child, but `getParent()` returns `null`
4. *Missing Null Check:* The `afterReplay()` method directly calls
`getParent(env).attachRemoteProc(this)` without checking if `getParent()`
returns null.
h2. StackTrace
{code:java}
java.lang.NullPointerException
at
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.getParent(RegionRemoteProcedureBase.java:279)
at
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.afterReplay(RegionRemoteProcedureBase.java:392)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.lambda$pushProceduresAfterLoad$5(ProcedureExecutor.java:529)
at java.util.ArrayList.forEach(ArrayList.java:1259)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.pushProceduresAfterLoad(ProcedureExecutor.java:528)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.loadProcedures(ProcedureExecutor.java:612)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$400(ProcedureExecutor.java:77)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$2.load(ProcedureExecutor.java:351)
at
org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.load(RegionProcedureStore.java:284)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:342)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:665)
{code}
h2. Proposed Fix
*Option 1: Null Check in afterReplay()*
{code:java}
@Override
protected void afterReplay(MasterProcedureEnv env) {
TransitRegionStateProcedure parent = getParent(env);
if (parent != null) {
parent.attachRemoteProc(this);
} else {
// Parent procedure has already completed - this child should be cleaned up
LOG.warn("Parent procedure {} not found for {}. Parent may have been
completed/rolled back.",
getParentProcId(), this);
// Consider setting this procedure's state to allow cleanup
}
} {code}
*Option 2: Lookup in Both Maps*
Modify `getParent()` to also check the completed procedures:
{code:java}
private TransitRegionStateProcedure getParent(MasterProcedureEnv env) {
Procedure<?> parent = env.getMasterServices().getMasterProcedureExecutor()
.getProcedure(getParentProcId());
if (parent == null) {
// Check if parent is in completed procedures
parent = env.getMasterServices().getMasterProcedureExecutor()
.getResult(getParentProcId());
}
return (TransitRegionStateProcedure) parent;
} {code}
**
**
--
This message was sent by Atlassian Jira
(v8.20.10#820010)