ConfX created HBASE-29805:
-----------------------------

             Summary: NullPointerException in AbstractRpcClient.createAddr 
after a RegionServer restart
                 Key: HBASE-29805
                 URL: https://issues.apache.org/jira/browse/HBASE-29805
             Project: HBase
          Issue Type: Bug
          Components: regionserver
    Affects Versions: 2.6.4, 2.6.3
            Reporter: ConfX
         Attachments: createAddr-NPE.patch

h2. Summary
A `NullPointerException` occurs in `AbstractRpcClient.createAddr()` when 
attempting to split a region after a RegionServer restart. The root cause is a 
missing null check for `node.getRegionLocation()` in 
`SplitTableRegionProcedure.checkSplittable()`.
h2. Failure Stacktrace
{code:java}
java.lang.NullPointerException
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.createAddr(AbstractRpcClient.java:459)
 at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.createBlockingRpcChannel(AbstractRpcClient.java:538)
 at 
org.apache.hadoop.hbase.client.ConnectionImplementation.lambda$getAdmin$7(ConnectionImplementation.java:1440)
 at 
org.apache.hadoop.hbase.util.ConcurrentMapUtils.computeIfAbsentEx(ConcurrentMapUtils.java:51)
 at 
org.apache.hadoop.hbase.client.ConnectionImplementation.getAdmin(ConnectionImplementation.java:1438)
 at 
org.apache.hadoop.hbase.client.ServerConnectionUtils$ShortCircuitingClusterConnection.getAdmin(ServerConnectionUtils.java:88)
 at 
org.apache.hadoop.hbase.master.assignment.AssignmentManagerUtil.getRegionInfoResponse(AssignmentManagerUtil.java:76)
 at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.checkSplittable(SplitTableRegionProcedure.java:220)
 at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.<init>(SplitTableRegionProcedure.java:137){code}
 

Root Cause
**File:** 
`hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java`
{code:java}
if (node != null) {
  try {
    GetRegionInfoResponse response;
    if (!hasBestSplitRow()) {
      LOG.info(
        "{} splitKey isn't explicitly specified, will try to find a best split 
key from RS {}",
        node.getRegionInfo().getRegionNameAsString(), node.getRegionLocation());
      response = AssignmentManagerUtil.getRegionInfoResponse(env, 
node.getRegionLocation(),  // <-- BUG: node.getRegionLocation() can be null
        node.getRegionInfo(), true);
      bestSplitRow =
        response.hasBestSplitRow() ? response.getBestSplitRow().toByteArray() : 
null;
    } else {
      response = AssignmentManagerUtil.getRegionInfoResponse(env, 
node.getRegionLocation(),  // <-- BUG: node.getRegionLocation() can be null
        node.getRegionInfo(), false);
    } {code}
h2. Why This Happens

1. *`checkOnline()` is insufficient:* Before `checkSplittable()` is called, 
`checkOnline(env, regionToSplit)` is invoked (line 119). However, 
`checkOnline()` only verifies that the region state is `OPEN`:
{code:java}
   // RegionStateNode.checkOnline()
   public void checkOnline() throws DoNotRetryRegionException {
     RegionInfo ri = getRegionInfo();
     State s = state;
     if (s != State.OPEN) {
       throw new DoNotRetryRegionException(ri.getEncodedName() + " is not OPEN; 
state=" + s);
     }
     // ... other checks but NO check for getRegionLocation() != null
   } {code}
2. *Region can be OPEN with null location:* After a RegionServer restart, a 
region can temporarily be in the OPEN state (or transition to OPEN) before a 
server location is assigned. During this window, `node.getRegionLocation()` 
returns `null`.
3. *Null propagation:* The null `ServerName` is passed down:
   - `AssignmentManagerUtil.getRegionInfoResponse(env, null, ...)` (line 76)
   - `env.getMasterServices().getClusterConnection().getAdmin(null)` 
(ConnectionImplementation.java:1440)
   - `this.rpcClient.createBlockingRpcChannel(null, user, rpcTimeout)` 
(AbstractRpcClient.java:538)
   - `createAddr(null)` which does `null.getHostname()` causing NPE 
(AbstractRpcClient.java:459)
 
 
 
h2. Call Flow to NPE
{code:java}
SplitTableRegionProcedure.<init>() [line 137]
  -> checkSplittable() [line 220]
    -> AssignmentManagerUtil.getRegionInfoResponse(env, 
node.getRegionLocation(), ...) [line 76]
      -> 
env.getMasterServices().getClusterConnection().getAdmin(regionLocation) [null 
is passed]
        -> ConnectionImplementation.getAdmin(null)
          -> rpcClient.createBlockingRpcChannel(null, user, rpcTimeout)
            -> createAddr(null)
              -> null.getHostname()  // NPE! {code}
This bug is triggered when:
 # A RegionServer is restarted (gracefully or forcefully)
 # A split operation is initiated for a region that was hosted on that 
RegionServer
 # The split is initiated before the region is fully re-assigned to a server

 
h2. Impact
- Split operations can fail with an uninformative NPE instead of a meaningful 
error message
- Users cannot split regions during or immediately after RegionServer restarts
- The NPE provides no indication of the actual problem (region not assigned)
 
h2. Suggested Fix
**Option 1: Add explicit null check and throw meaningful exception**
{code:java}
private void checkSplittable(final MasterProcedureEnv env, final RegionInfo 
regionToSplit)
  throws IOException {
  // ... existing code ...  RegionStateNode node =
    
env.getAssignmentManager().getRegionStates().getRegionStateNode(getParentRegion());
  IOException splittableCheckIOE = null;
  boolean splittable = false;
  if (node != null) {
    // ADD THIS NULL CHECK:
    ServerName regionLocation = node.getRegionLocation();
    if (regionLocation == null) {
      throw new DoNotRetryIOException(
        "Region " + regionToSplit.getShortNameToLog() +
        " is not assigned to any server. Cannot check splittability.");
    }    try {
      GetRegionInfoResponse response;
      if (!hasBestSplitRow()) {
        LOG.info(
          "{} splitKey isn't explicitly specified, will try to find a best 
split key from RS {}",
          node.getRegionInfo().getRegionNameAsString(), regionLocation);
        response = AssignmentManagerUtil.getRegionInfoResponse(env, 
regionLocation,
          node.getRegionInfo(), true); {code}
**Option 2: Enhance `checkOnline()` to also verify location is assigned**
{code:java}
// In RegionStateNode.java
public void checkOnline() throws DoNotRetryRegionException {
  RegionInfo ri = getRegionInfo();
  State s = state;
  if (s != State.OPEN) {
    throw new DoNotRetryRegionException(ri.getEncodedName() + " is not OPEN; 
state=" + s);
  }
  // ADD THIS CHECK:
  if (getRegionLocation() == null) {
    throw new DoNotRetryRegionException(
      ri.getEncodedName() + " is OPEN but not assigned to any server");
  }
  // ... rest of existing checks
} {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to