[ 
https://issues.apache.org/jira/browse/HBASE-29806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ConfX updated HBASE-29806:
--------------------------
    Description: 
h2. Summary

`RegionRemoteProcedureBase.afterReplay()` does not handle the case where the 
parent `TransitRegionStateProcedure` has already been completed 
(finished/rolled back). This results in a `NullPointerException` when 
attempting to attach the remote procedure to a non-existent parent during 
procedure executor restart.
This causes master procedure executor to fail during restart.
 
h2. Root Cause

*File:* 
`hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionRemoteProcedureBase.java`
 
*Lines 277-280 (getParent method):*
{code:java}
private TransitRegionStateProcedure getParent(MasterProcedureEnv env) {
  return (TransitRegionStateProcedure) 
env.getMasterServices().getMasterProcedureExecutor()
    .getProcedure(getParentProcId());
} {code}
*Lines 391-393 (afterReplay method):*
{code:java}
@Override
protected void afterReplay(MasterProcedureEnv env) {
  getParent(env).attachRemoteProc(this);  // NPE if parent is null
} {code}
h2. Why the Bug Occurs

1. *Procedure Loading Behavior:* When `ProcedureExecutor` loads procedures 
after restart:
   - Finished procedures go to the `completed` map
   - Unfinished procedures go to the `procedures` map

2. *getProcedure Only Checks Active Procedures:* The `getProcedure(procId)` 
method at `ProcedureExecutor.java:1191-1192` only checks the `procedures` map:
{code:java}
   public Procedure<TEnvironment> getProcedure(final long procId) {
     return procedures.get(procId);  // Does not check 'completed' map
   }{code}
3. *Race Condition During Rollback:* During rollback of a 
`ServerCrashProcedure`:
   - The parent `TransitRegionStateProcedure` may complete/finish (rollback 
successfully)
   - Its child `RegionRemoteProcedureBase` may still be pending in the store
   - On restart, the parent is in `completed` map, child is in `procedures` map
   - `afterReplay()` is called on the child, but `getParent()` returns `null`

4. *Missing Null Check:* The `afterReplay()` method directly calls 
`getParent(env).attachRemoteProc(this)` without checking if `getParent()` 
returns null.

 
h2. StackTrace
{code:java}
java.lang.NullPointerException
    at 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.getParent(RegionRemoteProcedureBase.java:279)
    at 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.afterReplay(RegionRemoteProcedureBase.java:392)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.lambda$pushProceduresAfterLoad$5(ProcedureExecutor.java:529)
    at java.util.ArrayList.forEach(ArrayList.java:1259)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.pushProceduresAfterLoad(ProcedureExecutor.java:528)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.loadProcedures(ProcedureExecutor.java:612)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$400(ProcedureExecutor.java:77)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$2.load(ProcedureExecutor.java:351)
    at 
org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.load(RegionProcedureStore.java:284)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:342)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:665)
 {code}
h2. Proposed Fix

*Option 1: Null Check in afterReplay()*
{code:java}
@Override
protected void afterReplay(MasterProcedureEnv env) {
  TransitRegionStateProcedure parent = getParent(env);
  if (parent != null) {
    parent.attachRemoteProc(this);
  } else {
    // Parent procedure has already completed - this child should be cleaned up
    LOG.warn("Parent procedure {} not found for {}. Parent may have been 
completed/rolled back.",
      getParentProcId(), this);
    // Consider setting this procedure's state to allow cleanup
  }
} {code}
*Option 2: Lookup in Both Maps*
Modify `getParent()` to also check the completed procedures:
{code:java}
private TransitRegionStateProcedure getParent(MasterProcedureEnv env) {
  Procedure<?> parent = env.getMasterServices().getMasterProcedureExecutor()
    .getProcedure(getParentProcId());
  if (parent == null) {
    // Check if parent is in completed procedures
    parent = env.getMasterServices().getMasterProcedureExecutor()
      .getResult(getParentProcId());
  }
  return (TransitRegionStateProcedure) parent;
} {code}
 

  was:
h2. Summary
`RegionRemoteProcedureBase.afterReplay()` does not handle the case where the 
parent `TransitRegionStateProcedure` has already been completed 
(finished/rolled back). This results in a `NullPointerException` when 
attempting to attach the remote procedure to a non-existent parent during 
procedure executor restart.
This causes master procedure executor to fail during restart.
 
h2. Root Cause
*File:* 
`hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionRemoteProcedureBase.java`
 
*Lines 277-280 (getParent method):*
{code:java}
private TransitRegionStateProcedure getParent(MasterProcedureEnv env) {
  return (TransitRegionStateProcedure) 
env.getMasterServices().getMasterProcedureExecutor()
    .getProcedure(getParentProcId());
} {code}
*Lines 391-393 (afterReplay method):*
{code:java}
@Override
protected void afterReplay(MasterProcedureEnv env) {
  getParent(env).attachRemoteProc(this);  // NPE if parent is null
} {code}
h2. Why the Bug Occurs

1. *Procedure Loading Behavior:* When `ProcedureExecutor` loads procedures 
after restart:
   - Finished procedures go to the `completed` map
   - Unfinished procedures go to the `procedures` map

2. *getProcedure Only Checks Active Procedures:* The `getProcedure(procId)` 
method at `ProcedureExecutor.java:1191-1192` only checks the `procedures` map:
{code:java}
   public Procedure<TEnvironment> getProcedure(final long procId) {
     return procedures.get(procId);  // Does not check 'completed' map
   }{code}
3. *Race Condition During Rollback:* During rollback of a 
`ServerCrashProcedure`:
   - The parent `TransitRegionStateProcedure` may complete/finish (rollback 
successfully)
   - Its child `RegionRemoteProcedureBase` may still be pending in the store
   - On restart, the parent is in `completed` map, child is in `procedures` map
   - `afterReplay()` is called on the child, but `getParent()` returns `null`

4. *Missing Null Check:* The `afterReplay()` method directly calls 
`getParent(env).attachRemoteProc(this)` without checking if `getParent()` 
returns null.

 
h2. StackTrace
{code:java}
java.lang.NullPointerException
    at 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.getParent(RegionRemoteProcedureBase.java:279)
    at 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.afterReplay(RegionRemoteProcedureBase.java:392)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.lambda$pushProceduresAfterLoad$5(ProcedureExecutor.java:529)
    at java.util.ArrayList.forEach(ArrayList.java:1259)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.pushProceduresAfterLoad(ProcedureExecutor.java:528)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.loadProcedures(ProcedureExecutor.java:612)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$400(ProcedureExecutor.java:77)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$2.load(ProcedureExecutor.java:351)
    at 
org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.load(RegionProcedureStore.java:284)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:342)
    at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:665)
 {code}
h2. Proposed Fix
*Option 1: Null Check in afterReplay()*
{code:java}
@Override
protected void afterReplay(MasterProcedureEnv env) {
  TransitRegionStateProcedure parent = getParent(env);
  if (parent != null) {
    parent.attachRemoteProc(this);
  } else {
    // Parent procedure has already completed - this child should be cleaned up
    LOG.warn("Parent procedure {} not found for {}. Parent may have been 
completed/rolled back.",
      getParentProcId(), this);
    // Consider setting this procedure's state to allow cleanup
  }
} {code}
*Option 2: Lookup in Both Maps*
Modify `getParent()` to also check the completed procedures:
{code:java}
private TransitRegionStateProcedure getParent(MasterProcedureEnv env) {
  Procedure<?> parent = env.getMasterServices().getMasterProcedureExecutor()
    .getProcedure(getParentProcId());
  if (parent == null) {
    // Check if parent is in completed procedures
    parent = env.getMasterServices().getMasterProcedureExecutor()
      .getResult(getParentProcId());
  }
  return (TransitRegionStateProcedure) parent;
} {code}
**
 ** 


> master procedure executor fail restart due to NPE in 
> RegionRemoteProcedureBase.afterReplay()
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-29806
>                 URL: https://issues.apache.org/jira/browse/HBASE-29806
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 2.6.3, 2.6.4
>            Reporter: ConfX
>            Priority: Major
>         Attachments: afterReplay-NPE.patch
>
>
> h2. Summary
> `RegionRemoteProcedureBase.afterReplay()` does not handle the case where the 
> parent `TransitRegionStateProcedure` has already been completed 
> (finished/rolled back). This results in a `NullPointerException` when 
> attempting to attach the remote procedure to a non-existent parent during 
> procedure executor restart.
> This causes master procedure executor to fail during restart.
>  
> h2. Root Cause
> *File:* 
> `hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionRemoteProcedureBase.java`
>  
> *Lines 277-280 (getParent method):*
> {code:java}
> private TransitRegionStateProcedure getParent(MasterProcedureEnv env) {
>   return (TransitRegionStateProcedure) 
> env.getMasterServices().getMasterProcedureExecutor()
>     .getProcedure(getParentProcId());
> } {code}
> *Lines 391-393 (afterReplay method):*
> {code:java}
> @Override
> protected void afterReplay(MasterProcedureEnv env) {
>   getParent(env).attachRemoteProc(this);  // NPE if parent is null
> } {code}
> h2. Why the Bug Occurs
> 1. *Procedure Loading Behavior:* When `ProcedureExecutor` loads procedures 
> after restart:
>    - Finished procedures go to the `completed` map
>    - Unfinished procedures go to the `procedures` map
> 2. *getProcedure Only Checks Active Procedures:* The `getProcedure(procId)` 
> method at `ProcedureExecutor.java:1191-1192` only checks the `procedures` map:
> {code:java}
>    public Procedure<TEnvironment> getProcedure(final long procId) {
>      return procedures.get(procId);  // Does not check 'completed' map
>    }{code}
> 3. *Race Condition During Rollback:* During rollback of a 
> `ServerCrashProcedure`:
>    - The parent `TransitRegionStateProcedure` may complete/finish (rollback 
> successfully)
>    - Its child `RegionRemoteProcedureBase` may still be pending in the store
>    - On restart, the parent is in `completed` map, child is in `procedures` 
> map
>    - `afterReplay()` is called on the child, but `getParent()` returns `null`
> 4. *Missing Null Check:* The `afterReplay()` method directly calls 
> `getParent(env).attachRemoteProc(this)` without checking if `getParent()` 
> returns null.
>  
> h2. StackTrace
> {code:java}
> java.lang.NullPointerException
>     at 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.getParent(RegionRemoteProcedureBase.java:279)
>     at 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.afterReplay(RegionRemoteProcedureBase.java:392)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.lambda$pushProceduresAfterLoad$5(ProcedureExecutor.java:529)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.pushProceduresAfterLoad(ProcedureExecutor.java:528)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.loadProcedures(ProcedureExecutor.java:612)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$400(ProcedureExecutor.java:77)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$2.load(ProcedureExecutor.java:351)
>     at 
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.load(RegionProcedureStore.java:284)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:342)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:665)
>  {code}
> h2. Proposed Fix
> *Option 1: Null Check in afterReplay()*
> {code:java}
> @Override
> protected void afterReplay(MasterProcedureEnv env) {
>   TransitRegionStateProcedure parent = getParent(env);
>   if (parent != null) {
>     parent.attachRemoteProc(this);
>   } else {
>     // Parent procedure has already completed - this child should be cleaned 
> up
>     LOG.warn("Parent procedure {} not found for {}. Parent may have been 
> completed/rolled back.",
>       getParentProcId(), this);
>     // Consider setting this procedure's state to allow cleanup
>   }
> } {code}
> *Option 2: Lookup in Both Maps*
> Modify `getParent()` to also check the completed procedures:
> {code:java}
> private TransitRegionStateProcedure getParent(MasterProcedureEnv env) {
>   Procedure<?> parent = env.getMasterServices().getMasterProcedureExecutor()
>     .getProcedure(getParentProcId());
>   if (parent == null) {
>     // Check if parent is in completed procedures
>     parent = env.getMasterServices().getMasterProcedureExecutor()
>       .getResult(getParentProcId());
>   }
>   return (TransitRegionStateProcedure) parent;
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to