ConfX created HBASE-29805:
-----------------------------
Summary: NullPointerException in AbstractRpcClient.createAddr
after a RegionServer restart
Key: HBASE-29805
URL: https://issues.apache.org/jira/browse/HBASE-29805
Project: HBase
Issue Type: Bug
Components: regionserver
Affects Versions: 2.6.4, 2.6.3
Reporter: ConfX
Attachments: createAddr-NPE.patch
h2. Summary
A `NullPointerException` occurs in `AbstractRpcClient.createAddr()` when
attempting to split a region after a RegionServer restart. The root cause is a
missing null check for `node.getRegionLocation()` in
`SplitTableRegionProcedure.checkSplittable()`.
h2. Failure Stacktrace
{code:java}
java.lang.NullPointerException
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.createAddr(AbstractRpcClient.java:459)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.createBlockingRpcChannel(AbstractRpcClient.java:538)
at
org.apache.hadoop.hbase.client.ConnectionImplementation.lambda$getAdmin$7(ConnectionImplementation.java:1440)
at
org.apache.hadoop.hbase.util.ConcurrentMapUtils.computeIfAbsentEx(ConcurrentMapUtils.java:51)
at
org.apache.hadoop.hbase.client.ConnectionImplementation.getAdmin(ConnectionImplementation.java:1438)
at
org.apache.hadoop.hbase.client.ServerConnectionUtils$ShortCircuitingClusterConnection.getAdmin(ServerConnectionUtils.java:88)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManagerUtil.getRegionInfoResponse(AssignmentManagerUtil.java:76)
at
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.checkSplittable(SplitTableRegionProcedure.java:220)
at
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.<init>(SplitTableRegionProcedure.java:137){code}
Root Cause
**File:**
`hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java`
{code:java}
if (node != null) {
try {
GetRegionInfoResponse response;
if (!hasBestSplitRow()) {
LOG.info(
"{} splitKey isn't explicitly specified, will try to find a best split
key from RS {}",
node.getRegionInfo().getRegionNameAsString(), node.getRegionLocation());
response = AssignmentManagerUtil.getRegionInfoResponse(env,
node.getRegionLocation(), // <-- BUG: node.getRegionLocation() can be null
node.getRegionInfo(), true);
bestSplitRow =
response.hasBestSplitRow() ? response.getBestSplitRow().toByteArray() :
null;
} else {
response = AssignmentManagerUtil.getRegionInfoResponse(env,
node.getRegionLocation(), // <-- BUG: node.getRegionLocation() can be null
node.getRegionInfo(), false);
} {code}
h2. Why This Happens
1. *`checkOnline()` is insufficient:* Before `checkSplittable()` is called,
`checkOnline(env, regionToSplit)` is invoked (line 119). However,
`checkOnline()` only verifies that the region state is `OPEN`:
{code:java}
// RegionStateNode.checkOnline()
public void checkOnline() throws DoNotRetryRegionException {
RegionInfo ri = getRegionInfo();
State s = state;
if (s != State.OPEN) {
throw new DoNotRetryRegionException(ri.getEncodedName() + " is not OPEN;
state=" + s);
}
// ... other checks but NO check for getRegionLocation() != null
} {code}
2. *Region can be OPEN with null location:* After a RegionServer restart, a
region can temporarily be in the OPEN state (or transition to OPEN) before a
server location is assigned. During this window, `node.getRegionLocation()`
returns `null`.
3. *Null propagation:* The null `ServerName` is passed down:
- `AssignmentManagerUtil.getRegionInfoResponse(env, null, ...)` (line 76)
- `env.getMasterServices().getClusterConnection().getAdmin(null)`
(ConnectionImplementation.java:1440)
- `this.rpcClient.createBlockingRpcChannel(null, user, rpcTimeout)`
(AbstractRpcClient.java:538)
- `createAddr(null)` which does `null.getHostname()` causing NPE
(AbstractRpcClient.java:459)
h2. Call Flow to NPE
{code:java}
SplitTableRegionProcedure.<init>() [line 137]
-> checkSplittable() [line 220]
-> AssignmentManagerUtil.getRegionInfoResponse(env,
node.getRegionLocation(), ...) [line 76]
->
env.getMasterServices().getClusterConnection().getAdmin(regionLocation) [null
is passed]
-> ConnectionImplementation.getAdmin(null)
-> rpcClient.createBlockingRpcChannel(null, user, rpcTimeout)
-> createAddr(null)
-> null.getHostname() // NPE! {code}
This bug is triggered when:
# A RegionServer is restarted (gracefully or forcefully)
# A split operation is initiated for a region that was hosted on that
RegionServer
# The split is initiated before the region is fully re-assigned to a server
h2. Impact
- Split operations can fail with an uninformative NPE instead of a meaningful
error message
- Users cannot split regions during or immediately after RegionServer restarts
- The NPE provides no indication of the actual problem (region not assigned)
h2. Suggested Fix
**Option 1: Add explicit null check and throw meaningful exception**
{code:java}
private void checkSplittable(final MasterProcedureEnv env, final RegionInfo
regionToSplit)
throws IOException {
// ... existing code ... RegionStateNode node =
env.getAssignmentManager().getRegionStates().getRegionStateNode(getParentRegion());
IOException splittableCheckIOE = null;
boolean splittable = false;
if (node != null) {
// ADD THIS NULL CHECK:
ServerName regionLocation = node.getRegionLocation();
if (regionLocation == null) {
throw new DoNotRetryIOException(
"Region " + regionToSplit.getShortNameToLog() +
" is not assigned to any server. Cannot check splittability.");
} try {
GetRegionInfoResponse response;
if (!hasBestSplitRow()) {
LOG.info(
"{} splitKey isn't explicitly specified, will try to find a best
split key from RS {}",
node.getRegionInfo().getRegionNameAsString(), regionLocation);
response = AssignmentManagerUtil.getRegionInfoResponse(env,
regionLocation,
node.getRegionInfo(), true); {code}
**Option 2: Enhance `checkOnline()` to also verify location is assigned**
{code:java}
// In RegionStateNode.java
public void checkOnline() throws DoNotRetryRegionException {
RegionInfo ri = getRegionInfo();
State s = state;
if (s != State.OPEN) {
throw new DoNotRetryRegionException(ri.getEncodedName() + " is not OPEN;
state=" + s);
}
// ADD THIS CHECK:
if (getRegionLocation() == null) {
throw new DoNotRetryRegionException(
ri.getEncodedName() + " is OPEN but not assigned to any server");
}
// ... rest of existing checks
} {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)