[
https://issues.apache.org/jira/browse/HBASE-28659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viraj Jasani reassigned HBASE-28659:
------------------------------------
Assignee: Swarali Joshi
> NPE in hmaster (setServerState function)
> ----------------------------------------
>
> Key: HBASE-28659
> URL: https://issues.apache.org/jira/browse/HBASE-28659
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 3.0.0-beta-2
> Reporter: Ke Han
> Assignee: Swarali Joshi
> Priority: Major
> Attachments: hbase--master-d16bb50815b7.log
>
>
> I met NPE in master node after migrating data from 2.5.8.
> {code:java}
> [ERROR LOG]
> executionId = LrmpjV32
> ConfigIdx = test9
> Node02024-05-11T10:45:57,896 ERROR [PEWorker-15]
> procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception: pid=48,
> state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; ServerCrashProcedure
> hregion1,16020,1715424228375, splitWal=true, meta=true
> java.lang.NullPointerException: null
> at
> org.apache.hadoop.hbase.master.assignment.RegionStates.setServerState(RegionStates.java:409)
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
> at
> org.apache.hadoop.hbase.master.assignment.RegionStates.logSplitting(RegionStates.java:435)
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:226)
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
> 2024-05-11T10:45:57,918 ERROR [PEWorker-15] procedure2.ProcedureExecutor:
> Root Procedure pid=48, state=FAILED:SERVER_CRASH_SPLIT_LOGS, hasLock=true,
> exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime
> exception: pid=48, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true;
> ServerCrashProcedure hregion1,16020,1715424228375, splitWal=true,
> meta=true:java.lang.NullPointerException; ServerCrashProcedure
> hregion1,16020,1715424228375, splitWal=true, meta=true does not support
> rollback but the execution failed and try to rollback, code bug?
> org.apache.hadoop.hbase.procedure2.RemoteProcedureException:
> java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1826)
> ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1484)
> ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:80)
> ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] {code}
> h1. Reproduce
> This bug cannot be reproduced deterministically.
> I upgraded hbase cluster form 2.5.8 to 3.0.0 (commit: 516c89e8597fb) with 4
> nodes (1HM, 2RS, 1HDFS)
> h1. Root Case
> From the stack trace, the bug is in setServerState state function.
> {code:java}
> hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
> private void setServerState(ServerName serverName, ServerState state) {
> ServerStateNode serverNode = getServerNode(serverName);
> synchronized (serverNode) { // NPE!
> serverNode.setState(state);
> }
> } {code}
> The serverNode sometimes could be null and there's no null pointer check.
> {code:java}
> /** Returns Pertinent ServerStateNode or NULL if none found (Do not make
> modifications). */
> public ServerStateNode getServerNode(final ServerName serverName) {
> return serverMap.get(serverName);
> } {code}
> The potential fix could be adding a NULL serverNode. However, how it runs
> into this buggy state is unclear.
> I am running the workloads that could trigger the bug multiple times to see
> if I can find more information.
> I have attached the error log.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)