[ 
https://issues.apache.org/jira/browse/HBASE-29805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050017#comment-18050017
 ] 

Duo Zhang commented on HBASE-29805:
-----------------------------------

We use github pull request to submit patch, please open a PR there.

Thanks.

> NullPointerException in AbstractRpcClient.createAddr after a RegionServer 
> restart
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-29805
>                 URL: https://issues.apache.org/jira/browse/HBASE-29805
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 2.6.3, 2.6.4
>            Reporter: ConfX
>            Priority: Major
>         Attachments: createAddr-NPE.patch
>
>
> h2. Summary
> A `NullPointerException` occurs in `AbstractRpcClient.createAddr()` when 
> attempting to split a region after a RegionServer restart. The root cause is 
> a missing null check for `node.getRegionLocation()` in 
> `SplitTableRegionProcedure.checkSplittable()`.
> h2. Failure Stacktrace
> {code:java}
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.createAddr(AbstractRpcClient.java:459)
>  at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.createBlockingRpcChannel(AbstractRpcClient.java:538)
>  at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.lambda$getAdmin$7(ConnectionImplementation.java:1440)
>  at 
> org.apache.hadoop.hbase.util.ConcurrentMapUtils.computeIfAbsentEx(ConcurrentMapUtils.java:51)
>  at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.getAdmin(ConnectionImplementation.java:1438)
>  at 
> org.apache.hadoop.hbase.client.ServerConnectionUtils$ShortCircuitingClusterConnection.getAdmin(ServerConnectionUtils.java:88)
>  at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManagerUtil.getRegionInfoResponse(AssignmentManagerUtil.java:76)
>  at 
> org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.checkSplittable(SplitTableRegionProcedure.java:220)
>  at 
> org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.<init>(SplitTableRegionProcedure.java:137){code}
>  
> Root Cause
> **File:** 
> `hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java`
> {code:java}
> if (node != null) {
>   try {
>     GetRegionInfoResponse response;
>     if (!hasBestSplitRow()) {
>       LOG.info(
>         "{} splitKey isn't explicitly specified, will try to find a best 
> split key from RS {}",
>         node.getRegionInfo().getRegionNameAsString(), 
> node.getRegionLocation());
>       response = AssignmentManagerUtil.getRegionInfoResponse(env, 
> node.getRegionLocation(),  // <-- BUG: node.getRegionLocation() can be null
>         node.getRegionInfo(), true);
>       bestSplitRow =
>         response.hasBestSplitRow() ? response.getBestSplitRow().toByteArray() 
> : null;
>     } else {
>       response = AssignmentManagerUtil.getRegionInfoResponse(env, 
> node.getRegionLocation(),  // <-- BUG: node.getRegionLocation() can be null
>         node.getRegionInfo(), false);
>     } {code}
> h2. Why This Happens
> 1. *`checkOnline()` is insufficient:* Before `checkSplittable()` is called, 
> `checkOnline(env, regionToSplit)` is invoked (line 119). However, 
> `checkOnline()` only verifies that the region state is `OPEN`:
> {code:java}
>    // RegionStateNode.checkOnline()
>    public void checkOnline() throws DoNotRetryRegionException {
>      RegionInfo ri = getRegionInfo();
>      State s = state;
>      if (s != State.OPEN) {
>        throw new DoNotRetryRegionException(ri.getEncodedName() + " is not 
> OPEN; state=" + s);
>      }
>      // ... other checks but NO check for getRegionLocation() != null
>    } {code}
> 2. *Region can be OPEN with null location:* After a RegionServer restart, a 
> region can temporarily be in the OPEN state (or transition to OPEN) before a 
> server location is assigned. During this window, `node.getRegionLocation()` 
> returns `null`.
> 3. *Null propagation:* The null `ServerName` is passed down:
>    - `AssignmentManagerUtil.getRegionInfoResponse(env, null, ...)` (line 76)
>    - `env.getMasterServices().getClusterConnection().getAdmin(null)` 
> (ConnectionImplementation.java:1440)
>    - `this.rpcClient.createBlockingRpcChannel(null, user, rpcTimeout)` 
> (AbstractRpcClient.java:538)
>    - `createAddr(null)` which does `null.getHostname()` causing NPE 
> (AbstractRpcClient.java:459)
>  
>  
>  
> h2. Call Flow to NPE
> {code:java}
> SplitTableRegionProcedure.<init>() [line 137]
>   -> checkSplittable() [line 220]
>     -> AssignmentManagerUtil.getRegionInfoResponse(env, 
> node.getRegionLocation(), ...) [line 76]
>       -> 
> env.getMasterServices().getClusterConnection().getAdmin(regionLocation) [null 
> is passed]
>         -> ConnectionImplementation.getAdmin(null)
>           -> rpcClient.createBlockingRpcChannel(null, user, rpcTimeout)
>             -> createAddr(null)
>               -> null.getHostname()  // NPE! {code}
> This bug is triggered when:
>  # A RegionServer is restarted (gracefully or forcefully)
>  # A split operation is initiated for a region that was hosted on that 
> RegionServer
>  # The split is initiated before the region is fully re-assigned to a server
>  
> h2. Impact
> - Split operations can fail with an uninformative NPE instead of a meaningful 
> error message
> - Users cannot split regions during or immediately after RegionServer restarts
> - The NPE provides no indication of the actual problem (region not assigned)
>  
> h2. Suggested Fix
> **Option 1: Add explicit null check and throw meaningful exception**
> {code:java}
> private void checkSplittable(final MasterProcedureEnv env, final RegionInfo 
> regionToSplit)
>   throws IOException {
>   // ... existing code ...  RegionStateNode node =
>     
> env.getAssignmentManager().getRegionStates().getRegionStateNode(getParentRegion());
>   IOException splittableCheckIOE = null;
>   boolean splittable = false;
>   if (node != null) {
>     // ADD THIS NULL CHECK:
>     ServerName regionLocation = node.getRegionLocation();
>     if (regionLocation == null) {
>       throw new DoNotRetryIOException(
>         "Region " + regionToSplit.getShortNameToLog() +
>         " is not assigned to any server. Cannot check splittability.");
>     }    try {
>       GetRegionInfoResponse response;
>       if (!hasBestSplitRow()) {
>         LOG.info(
>           "{} splitKey isn't explicitly specified, will try to find a best 
> split key from RS {}",
>           node.getRegionInfo().getRegionNameAsString(), regionLocation);
>         response = AssignmentManagerUtil.getRegionInfoResponse(env, 
> regionLocation,
>           node.getRegionInfo(), true); {code}
> **Option 2: Enhance `checkOnline()` to also verify location is assigned**
> {code:java}
> // In RegionStateNode.java
> public void checkOnline() throws DoNotRetryRegionException {
>   RegionInfo ri = getRegionInfo();
>   State s = state;
>   if (s != State.OPEN) {
>     throw new DoNotRetryRegionException(ri.getEncodedName() + " is not OPEN; 
> state=" + s);
>   }
>   // ADD THIS CHECK:
>   if (getRegionLocation() == null) {
>     throw new DoNotRetryRegionException(
>       ri.getEncodedName() + " is OPEN but not assigned to any server");
>   }
>   // ... rest of existing checks
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to