[
https://issues.apache.org/jira/browse/HBASE-29804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ConfX updated HBASE-29804:
--------------------------
Attachment: serverAdded-NPE.patch
> NullPointerException in WorkerAssigner.serverAdded during Master Shutdown and
> Restart
> -------------------------------------------------------------------------------------
>
> Key: HBASE-29804
> URL: https://issues.apache.org/jira/browse/HBASE-29804
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 2.6.3, 2.6.4
> Reporter: ConfX
> Priority: Major
> Attachments: serverAdded-NPE.patch
>
>
> h2. Summary
>
> `WorkerAssigner.serverAdded()` throws NullPointerException when called during
> or after Master shutdown because it accesses `MasterProcedureExecutor`
> without null checks, but `MasterProcedureExecutor` is explicitly set to null
> during the shutdown process.
> h2. Affected Component
> - {*}{{*}}File:{{*}}{*}
> `hbase-server/src/main/java/org/apache/hadoop/hbase/master/WorkerAssigner.java`
> - {*}{{*}}Method:{{*}}{*} `serverAdded(ServerName worker)` at line 83
> h2. Root Cause Analysis
> h3. The Bug
>
> The `WorkerAssigner` class implements `ServerListener` and registers itself
> with `ServerManager` to receive server event notifications. In the
> `serverAdded()` callback method, it accesses a chain of objects without null
> checks:
> {code:java}
> @Override
> public void serverAdded(ServerName worker) {
>
> this.wake(master.getMasterProcedureExecutor().getEnvironment().getProcedureScheduler());
> }{code}
> h3. Why NPE Occurs
> During Master shutdown, the `procedureExecutor` is explicitly set to `null`
> in `HMaster.stopProcedureExecutor()`:
> {code:java}
> // HMaster.java:1849-1856
> private void stopProcedureExecutor() {
> if (procedureExecutor != null) {
>
> configurationManager.deregisterObserver(procedureExecutor.getEnvironment());
> procedureExecutor.getEnvironment().getRemoteDispatcher().stop();
> procedureExecutor.stop();
> procedureExecutor.join();
> procedureExecutor = null; // <-- Set to null here
> }
> // ...
> }{code}
> However, `WorkerAssigner` is {*}{{*}}never unregistered{{*}}{*} from
> `ServerManager` during shutdown. If any `serverAdded()` event is triggered
> during or after the shutdown process (while `WorkerAssigner` is still
> registered as a listener), it will cause an NPE because
> `getMasterProcedureExecutor()` returns `null`.
> h3. Call Chain When Failure Occurs
> 1. `ServerManager.regionServerReport()` (line 295)
> 2. `ServerManager.checkAndRecordNewServer()` (line 377)
> 3. `ServerListener.serverAdded()` callback for all registered listeners
> 4. `WorkerAssigner.serverAdded()` (line 83) - {*}{{*}}NPE occurs here{{*}}{*}
> h3. Stacktrace
> {code:java}
> java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.master.WorkerAssigner.serverAdded(WorkerAssigner.java:83)
> at
> org.apache.hadoop.hbase.master.ServerManager.checkAndRecordNewServer(ServerManager.java:377)
> at
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:295)
> at
> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:573)
> {code}
> h3. Potential Fix
> h4. Option 1: Add Null Check in serverAdded() (Recommended)
>
> {code:java}
> @Override
> public void serverAdded(ServerName worker) {
> ProcedureExecutor<MasterProcedureEnv> executor =
> master.getMasterProcedureExecutor();
> if (executor != null && executor.getEnvironment() != null) {
> this.wake(executor.getEnvironment().getProcedureScheduler());
> }
> } {code}
> h3. Option 2: Unregister WorkerAssigner During Shutdown
>
> Add a `close()` or `stop()` method to `WorkerAssigner` that unregisters it
> from `ServerManager`, and call it during master shutdown in `SplitWALManager`
> and `SnapshotManager`.
>
> {code:java}
> public void stop() {
> ServerManager sm = this.master.getServerManager();
> if (sm != null) {
> sm.unregisterListener(this);
> }
> } {code}
> h3. Option 3: Both (Most Robust)
> Implement both the null check (defensive programming) AND proper
> unregistration (proper lifecycle management).
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)