[
https://issues.apache.org/jira/browse/HBASE-29805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050017#comment-18050017
]
Duo Zhang commented on HBASE-29805:
-----------------------------------
We use github pull request to submit patch, please open a PR there.
Thanks.
> NullPointerException in AbstractRpcClient.createAddr after a RegionServer
> restart
> ---------------------------------------------------------------------------------
>
> Key: HBASE-29805
> URL: https://issues.apache.org/jira/browse/HBASE-29805
> Project: HBase
> Issue Type: Bug
> Components: regionserver
> Affects Versions: 2.6.3, 2.6.4
> Reporter: ConfX
> Priority: Major
> Attachments: createAddr-NPE.patch
>
>
> h2. Summary
> A `NullPointerException` occurs in `AbstractRpcClient.createAddr()` when
> attempting to split a region after a RegionServer restart. The root cause is
> a missing null check for `node.getRegionLocation()` in
> `SplitTableRegionProcedure.checkSplittable()`.
> h2. Failure Stacktrace
> {code:java}
> java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.createAddr(AbstractRpcClient.java:459)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.createBlockingRpcChannel(AbstractRpcClient.java:538)
> at
> org.apache.hadoop.hbase.client.ConnectionImplementation.lambda$getAdmin$7(ConnectionImplementation.java:1440)
> at
> org.apache.hadoop.hbase.util.ConcurrentMapUtils.computeIfAbsentEx(ConcurrentMapUtils.java:51)
> at
> org.apache.hadoop.hbase.client.ConnectionImplementation.getAdmin(ConnectionImplementation.java:1438)
> at
> org.apache.hadoop.hbase.client.ServerConnectionUtils$ShortCircuitingClusterConnection.getAdmin(ServerConnectionUtils.java:88)
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManagerUtil.getRegionInfoResponse(AssignmentManagerUtil.java:76)
> at
> org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.checkSplittable(SplitTableRegionProcedure.java:220)
> at
> org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.<init>(SplitTableRegionProcedure.java:137){code}
>
> Root Cause
> **File:**
> `hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java`
> {code:java}
> if (node != null) {
> try {
> GetRegionInfoResponse response;
> if (!hasBestSplitRow()) {
> LOG.info(
> "{} splitKey isn't explicitly specified, will try to find a best
> split key from RS {}",
> node.getRegionInfo().getRegionNameAsString(),
> node.getRegionLocation());
> response = AssignmentManagerUtil.getRegionInfoResponse(env,
> node.getRegionLocation(), // <-- BUG: node.getRegionLocation() can be null
> node.getRegionInfo(), true);
> bestSplitRow =
> response.hasBestSplitRow() ? response.getBestSplitRow().toByteArray()
> : null;
> } else {
> response = AssignmentManagerUtil.getRegionInfoResponse(env,
> node.getRegionLocation(), // <-- BUG: node.getRegionLocation() can be null
> node.getRegionInfo(), false);
> } {code}
> h2. Why This Happens
> 1. *`checkOnline()` is insufficient:* Before `checkSplittable()` is called,
> `checkOnline(env, regionToSplit)` is invoked (line 119). However,
> `checkOnline()` only verifies that the region state is `OPEN`:
> {code:java}
> // RegionStateNode.checkOnline()
> public void checkOnline() throws DoNotRetryRegionException {
> RegionInfo ri = getRegionInfo();
> State s = state;
> if (s != State.OPEN) {
> throw new DoNotRetryRegionException(ri.getEncodedName() + " is not
> OPEN; state=" + s);
> }
> // ... other checks but NO check for getRegionLocation() != null
> } {code}
> 2. *Region can be OPEN with null location:* After a RegionServer restart, a
> region can temporarily be in the OPEN state (or transition to OPEN) before a
> server location is assigned. During this window, `node.getRegionLocation()`
> returns `null`.
> 3. *Null propagation:* The null `ServerName` is passed down:
> - `AssignmentManagerUtil.getRegionInfoResponse(env, null, ...)` (line 76)
> - `env.getMasterServices().getClusterConnection().getAdmin(null)`
> (ConnectionImplementation.java:1440)
> - `this.rpcClient.createBlockingRpcChannel(null, user, rpcTimeout)`
> (AbstractRpcClient.java:538)
> - `createAddr(null)` which does `null.getHostname()` causing NPE
> (AbstractRpcClient.java:459)
>
>
>
> h2. Call Flow to NPE
> {code:java}
> SplitTableRegionProcedure.<init>() [line 137]
> -> checkSplittable() [line 220]
> -> AssignmentManagerUtil.getRegionInfoResponse(env,
> node.getRegionLocation(), ...) [line 76]
> ->
> env.getMasterServices().getClusterConnection().getAdmin(regionLocation) [null
> is passed]
> -> ConnectionImplementation.getAdmin(null)
> -> rpcClient.createBlockingRpcChannel(null, user, rpcTimeout)
> -> createAddr(null)
> -> null.getHostname() // NPE! {code}
> This bug is triggered when:
> # A RegionServer is restarted (gracefully or forcefully)
> # A split operation is initiated for a region that was hosted on that
> RegionServer
> # The split is initiated before the region is fully re-assigned to a server
>
> h2. Impact
> - Split operations can fail with an uninformative NPE instead of a meaningful
> error message
> - Users cannot split regions during or immediately after RegionServer restarts
> - The NPE provides no indication of the actual problem (region not assigned)
>
> h2. Suggested Fix
> **Option 1: Add explicit null check and throw meaningful exception**
> {code:java}
> private void checkSplittable(final MasterProcedureEnv env, final RegionInfo
> regionToSplit)
> throws IOException {
> // ... existing code ... RegionStateNode node =
>
> env.getAssignmentManager().getRegionStates().getRegionStateNode(getParentRegion());
> IOException splittableCheckIOE = null;
> boolean splittable = false;
> if (node != null) {
> // ADD THIS NULL CHECK:
> ServerName regionLocation = node.getRegionLocation();
> if (regionLocation == null) {
> throw new DoNotRetryIOException(
> "Region " + regionToSplit.getShortNameToLog() +
> " is not assigned to any server. Cannot check splittability.");
> } try {
> GetRegionInfoResponse response;
> if (!hasBestSplitRow()) {
> LOG.info(
> "{} splitKey isn't explicitly specified, will try to find a best
> split key from RS {}",
> node.getRegionInfo().getRegionNameAsString(), regionLocation);
> response = AssignmentManagerUtil.getRegionInfoResponse(env,
> regionLocation,
> node.getRegionInfo(), true); {code}
> **Option 2: Enhance `checkOnline()` to also verify location is assigned**
> {code:java}
> // In RegionStateNode.java
> public void checkOnline() throws DoNotRetryRegionException {
> RegionInfo ri = getRegionInfo();
> State s = state;
> if (s != State.OPEN) {
> throw new DoNotRetryRegionException(ri.getEncodedName() + " is not OPEN;
> state=" + s);
> }
> // ADD THIS CHECK:
> if (getRegionLocation() == null) {
> throw new DoNotRetryRegionException(
> ri.getEncodedName() + " is OPEN but not assigned to any server");
> }
> // ... rest of existing checks
> } {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)