[ https://issues.apache.org/jira/browse/HBASE-28951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Umesh Kumar Kumawat updated HBASE-28951: ---------------------------------------- Description: When a worker RS gets aborted after the SplitWALRemoteProcedure got dispatched, RegionServerTracker takes care of it and [aborts the pending Operation|https://github.com/apache/hbase/blob/rel/2.5.8/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java#L160] on the aborting region as part of [expireServer|https://github.com/apache/hbase/blob/rel/2.5.8/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionServerTracker.java#L172]. It did help the parent procedure, SplitWalProcedure, to choose another worker RS but the aborting RS is also splitting the WAL. Now while creating the recovered edits both will try to write the same file. One RS that starts late for the file deletes the previous file that cause failures. h4. Logs - region server tracker marking the remove procedure failed {code:java} 2024-10-01 23:02:32,274 WARN [RegionServerTracker-0] procedure.SplitWALRemoteProcedure - Sent hdfs://hbase1a/hbase/WALs/regionserver-33X<cluster>,XXXXX,1727362162836-splitting/regionserver-33X<cluster>%2CXXXXX%2C1727362162836.1727822221172 to wrong server regionserver-283X<cluster>,XXXXX,1727420096936, try another org.apache.hadoop.hbase.DoNotRetryIOException: server not online regionserver-283X<cluster>,XXXXX,1727420096936 at org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:163) at org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:61) at org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher$BufferNode.abortOperationsInQueue(RemoteProcedureDispatcher.java:417) at org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.removeNode(RemoteProcedureDispatcher.java:201) at org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.serverRemoved(RSProcedureDispatcher.java:176) at org.apache.hadoop.hbase.master.ServerManager.lambda$expireServer$2(ServerManager.java:576) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:576) at org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:530) at org.apache.hadoop.hbase.master.RegionServerTracker.processAsActiveMaster(RegionServerTracker.java:172) at org.apache.hadoop.hbase.master.RegionServerTracker.refresh(RegionServerTracker.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750){code} {code:java} 2024-10-01 23:02:32,340 INFO [PEWorker-21] procedure2.ProcedureExecutor - Finished pid=122448609, ppid=122448595, state=SUCCESS; SplitWALRemoteProcedure regionserver-33X<cluster>,XXXXX%2C1727362162836.1727822221172, worker=regionserver-283X<cluster>,XXXXX,1727420096936 in 54.0500 sec{code} Parent SplitWalProcedure will create another RemoteProcedure for this {code:java} 2024-10-01 23:02:32,726 WARN [PEWorker-17] procedure.SplitWALProcedure - Failed to split wal hdfs://hbase1a/hbase/WALs/regionserver-33.regionserver.hbase.<cluster>,XXXXX,1727362162836-splitting/regionserver-33X<cluster>,XXXXX%2C1727362162836.1727822221172 by server regionserver-283.regionserver.hbase.<cluster>,XXXXX,1727420096936, retry...{code} {code:java} 2024-10-01 23:02:39,414 INFO [PEWorker-28] procedure2.ProcedureExecutor - Initialized subprocedures=[{pid=122452821, ppid=122448595, state=RUNNABLE; SplitWALRemoteProcedure regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172, worker=regionserver-323.regionserver.hbase.<cluster>,XXXXX,1727308912906}]{code} Splitting still in progress on dying rs {code:java} 2024-10-01 23:02:45,652 INFO [G_REPLAY_OPS-regionserver/regionserver-283:XXXXX-0] wal.WALSplitter - Splitting hdfs://hbase1a/hbase/WALs/regionserver-33.regionserver.hbase.<cluster>,XXXXX,1727362162836-splitting/regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172, size=128.1 M (134313407bytes){code} rs-323 creating recovered edits {code:java} 2024-10-01 23:02:42,876 INFO [OPS-regionserver/regionserver-323:XXXXX-5-Writer-2] monitor.StreamSlowMonitor - New stream slow monitor 0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code} {code:java} 2024-10-01 23:02:43,171 INFO [OPS-regionserver/regionserver-323:XXXXX-5-Writer-2] wal.RecoveredEditsOutputSink - Creating recovered edits writer path=hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code} rs-283 deletes the above files and again creates the file {code:java} 2024-10-01 23:02:50,520 WARN [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2] wal.RecoveredEditsOutputSink - Found old edits file. It could be the result of a previous failed split attempt. Deleting hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp, length=0{code} {code:java} 2024-10-01 23:02:50,794 INFO [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2] monitor.StreamSlowMonitor - New stream slow monitor 0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code} {code:java} 2024-10-01 23:02:51,135 INFO [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2] wal.RecoveredEditsOutputSink - Creating recovered edits writer path=hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code} Now rs 323 will start failing {code:java} 2024-10-01 23:03:02,137 WARN [Thread-1081409] hdfs.DataStreamer - DataStreamer Exception java.io.FileNotFoundException: File does not exist: /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33<cluster>%2C60020%2C1727362162836.1727822221172.temp (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1, pending creates: 21] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1091) at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1939) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForCreate(DataStreamer.java:1734) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:717) Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33<cluster>%2C60020%2C1727362162836.1727822221172.temp (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1, pending creates: 21] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567) at org.apache.hadoop.ipc.Client.call(Client.java:1513) at org.apache.hadoop.ipc.Client.call(Client.java:1410) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139) at com.sun.proxy.$Proxy18.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$addBlock$11(ClientNamenodeProtocolTranslatorPB.java:495) at org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:495) at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) at com.sun.proxy.$Proxy19.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361) at com.sun.proxy.$Proxy20.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361) at com.sun.proxy.$Proxy20.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088) ... 3 more {code} {code:java} 2024-10-01 23:03:02,143 ERROR [split-log-closeStream-pool-1] wal.RecoveredEditsOutputSink - Could not close recovered edits at hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp java.io.FileNotFoundException: File does not exist: /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1, pending creates: 21] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1091) at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1939) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForCreate(DataStreamer.java:1734) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:717) Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1, pending creates: 21] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567) at org.apache.hadoop.ipc.Client.call(Client.java:1513) at org.apache.hadoop.ipc.Client.call(Client.java:1410) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139) at com.sun.proxy.$Proxy18.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$addBlock$11(ClientNamenodeProtocolTranslatorPB.java:495) at org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:495) at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) at com.sun.proxy.$Proxy19.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361) at com.sun.proxy.$Proxy20.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361) at com.sun.proxy.$Proxy20.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088) {code} Some more info that needs to be noted is - the aborting RS starts splitting a little late. I am adding one such case below. rs-219 is aborting one and rs-216 is the second worker. The aborting rs started 50 second late from the second worker although aborting rs was the one who got the RPC request first. {code:java} 2024-10-01 23:02:15,499 INFO [G_REPLAY_OPS-regionserver/regionserver-216:XXXXX-3] wal.WALSplitter - Splitting hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024, size=92 (92bytes){code} {code:java} 2024-10-01 23:03:05,793 INFO [G_REPLAY_OPS-regionserver/regionserver-219:XXXXX-1] wal.WALSplitter - Splitting hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024, size=93.4 M (97950842bytes){code} {code:java} 2024-10-01 23:03:15,405 INFO [G_REPLAY_OPS-regionserver/regionserver-216:XXXXX-3] wal.WALSplitter - Splitting hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024, size=93.4 M (97950842bytes){code} was: When a worker RS gets aborted after the SplitWALRemoteProcedure got dispatched, RegionServerTracker takes care of it and [aborts the pending Operation|https://github.com/apache/hbase/blob/rel/2.5.8/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java#L160] on the aborting region as part of [expireServer|https://github.com/apache/hbase/blob/rel/2.5.8/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionServerTracker.java#L172]. It did help the parent procedure, SplitWalProcedure, to choose another worker RS but the aborting RS is also splitting the WAL. Now while creating the recovered edits both will try to write the same file. One RS that starts late for the file deletes the previous file that cause failures. h4. Logs - region server tracker marking the remove procedure failed {code:java} 2024-10-01 23:02:32,274 WARN [RegionServerTracker-0] procedure.SplitWALRemoteProcedure - Sent hdfs://hbase1a/hbase/WALs/regionserver-33X<cluster>,XXXXX,1727362162836-splitting/regionserver-33X<cluster>%2CXXXXX%2C1727362162836.1727822221172 to wrong server regionserver-283X<cluster>,XXXXX,1727420096936, try another org.apache.hadoop.hbase.DoNotRetryIOException: server not online regionserver-283X<cluster>,XXXXX,1727420096936 at org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:163) at org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:61) at org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher$BufferNode.abortOperationsInQueue(RemoteProcedureDispatcher.java:417) at org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.removeNode(RemoteProcedureDispatcher.java:201) at org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.serverRemoved(RSProcedureDispatcher.java:176) at org.apache.hadoop.hbase.master.ServerManager.lambda$expireServer$2(ServerManager.java:576) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:576) at org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:530) at org.apache.hadoop.hbase.master.RegionServerTracker.processAsActiveMaster(RegionServerTracker.java:172) at org.apache.hadoop.hbase.master.RegionServerTracker.refresh(RegionServerTracker.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750){code} {code:java} 2024-10-01 23:02:32,340 INFO [PEWorker-21] procedure2.ProcedureExecutor - Finished pid=122448609, ppid=122448595, state=SUCCESS; SplitWALRemoteProcedure regionserver-33X<cluster>,XXXXX%2C1727362162836.1727822221172, worker=regionserver-283X<cluster>,XXXXX,1727420096936 in 54.0500 sec{code} Parent SplitWalProcedure will create another RemoteProcedure for this {code:java} 2024-10-01 23:02:32,726 WARN [PEWorker-17] procedure.SplitWALProcedure - Failed to split wal hdfs://hbase1a/hbase/WALs/regionserver-33.regionserver.hbase.<cluster>,XXXXX,1727362162836-splitting/regionserver-33X<cluster>,XXXXX%2C1727362162836.1727822221172 by server regionserver-283.regionserver.hbase.<cluster>,XXXXX,1727420096936, retry...{code} {code:java} 2024-10-01 23:02:39,414 INFO [PEWorker-28] procedure2.ProcedureExecutor - Initialized subprocedures=[{pid=122452821, ppid=122448595, state=RUNNABLE; SplitWALRemoteProcedure regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172, worker=regionserver-323.regionserver.hbase.<cluster>,XXXXX,1727308912906}]{code} Splitting still in progress on dying rs {code:java} 2024-10-01 23:02:45,652 INFO [G_REPLAY_OPS-regionserver/regionserver-283:XXXXX-0] wal.WALSplitter - Splitting hdfs://hbase1a/hbase/WALs/regionserver-33.regionserver.hbase.<cluster>,XXXXX,1727362162836-splitting/regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172, size=128.1 M (134313407bytes){code} rs-323 creating recovered edits {code:java} 2024-10-01 23:02:42,876 INFO [OPS-regionserver/regionserver-323:XXXXX-5-Writer-2] monitor.StreamSlowMonitor - New stream slow monitor 0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code} {code:java} 2024-10-01 23:02:43,171 INFO [OPS-regionserver/regionserver-323:XXXXX-5-Writer-2] wal.RecoveredEditsOutputSink - Creating recovered edits writer path=hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code} rs-283 deletes the above files and again creates the file {code:java} 2024-10-01 23:02:50,520 WARN [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2] wal.RecoveredEditsOutputSink - Found old edits file. It could be the result of a previous failed split attempt. Deleting hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp, length=0{code} {code:java} 2024-10-01 23:02:50,794 INFO [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2] monitor.StreamSlowMonitor - New stream slow monitor 0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code} {code:java} 2024-10-01 23:02:51,135 INFO [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2] wal.RecoveredEditsOutputSink - Creating recovered edits writer path=hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code} Now rs 323 will start failing {code:java} 2024-10-01 23:03:02,137 WARN [Thread-1081409] hdfs.DataStreamer - DataStreamer Exception java.io.FileNotFoundException: File does not exist: /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.hbase1a.hbase.core2.aws-prod5-uswest2.aws.sfdc.is%2C60020%2C1727362162836.1727822221172.temp (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1, pending creates: 21] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1091) at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1939) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForCreate(DataStreamer.java:1734) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:717) Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.hbase1a.hbase.core2.aws-prod5-uswest2.aws.sfdc.is%2C60020%2C1727362162836.1727822221172.temp (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1, pending creates: 21] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567) at org.apache.hadoop.ipc.Client.call(Client.java:1513) at org.apache.hadoop.ipc.Client.call(Client.java:1410) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139) at com.sun.proxy.$Proxy18.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$addBlock$11(ClientNamenodeProtocolTranslatorPB.java:495) at org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:495) at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) at com.sun.proxy.$Proxy19.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361) at com.sun.proxy.$Proxy20.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361) at com.sun.proxy.$Proxy20.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088) ... 3 more {code} {code:java} 2024-10-01 23:03:02,143 ERROR [split-log-closeStream-pool-1] wal.RecoveredEditsOutputSink - Could not close recovered edits at hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp java.io.FileNotFoundException: File does not exist: /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1, pending creates: 21] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1091) at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1939) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForCreate(DataStreamer.java:1734) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:717) Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1, pending creates: 21] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567) at org.apache.hadoop.ipc.Client.call(Client.java:1513) at org.apache.hadoop.ipc.Client.call(Client.java:1410) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139) at com.sun.proxy.$Proxy18.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$addBlock$11(ClientNamenodeProtocolTranslatorPB.java:495) at org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:495) at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) at com.sun.proxy.$Proxy19.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361) at com.sun.proxy.$Proxy20.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361) at com.sun.proxy.$Proxy20.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088) {code} Some more info that needs to be noted is - the aborting RS starts splitting a little late. I am adding one such case below. rs-219 is aborting one and rs-216 is the second worker. The aborting rs started 50 second late from the second worker although aborting rs was the one who got the RPC request first. {code:java} 2024-10-01 23:02:15,499 INFO [G_REPLAY_OPS-regionserver/regionserver-216:XXXXX-3] wal.WALSplitter - Splitting hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024, size=92 (92bytes){code} {code:java} 2024-10-01 23:03:05,793 INFO [G_REPLAY_OPS-regionserver/regionserver-219:XXXXX-1] wal.WALSplitter - Splitting hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024, size=93.4 M (97950842bytes){code} {code:java} 2024-10-01 23:03:15,405 INFO [G_REPLAY_OPS-regionserver/regionserver-216:XXXXX-3] wal.WALSplitter - Splitting hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024, size=93.4 M (97950842bytes){code} > WAL Split Delays Due to Concurrent WAL Splitting During worker RS Abort > ----------------------------------------------------------------------- > > Key: HBASE-28951 > URL: https://issues.apache.org/jira/browse/HBASE-28951 > Project: HBase > Issue Type: Bug > Affects Versions: 2.5.8 > Reporter: Umesh Kumar Kumawat > Assignee: Umesh Kumar Kumawat > Priority: Major > Labels: pull-request-available > > When a worker RS gets aborted after the SplitWALRemoteProcedure got > dispatched, RegionServerTracker takes care of it and [aborts the pending > Operation|https://github.com/apache/hbase/blob/rel/2.5.8/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java#L160] > on the aborting region as part of > [expireServer|https://github.com/apache/hbase/blob/rel/2.5.8/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionServerTracker.java#L172]. > > It did help the parent procedure, SplitWalProcedure, to choose another worker > RS but the aborting RS is also splitting the WAL. Now while creating the > recovered edits both will try to write the same file. One RS that starts late > for the file deletes the previous file that cause failures. > h4. Logs - > region server tracker marking the remove procedure failed > {code:java} > 2024-10-01 23:02:32,274 WARN [RegionServerTracker-0] > procedure.SplitWALRemoteProcedure - Sent > hdfs://hbase1a/hbase/WALs/regionserver-33X<cluster>,XXXXX,1727362162836-splitting/regionserver-33X<cluster>%2CXXXXX%2C1727362162836.1727822221172 > to wrong server regionserver-283X<cluster>,XXXXX,1727420096936, try another > org.apache.hadoop.hbase.DoNotRetryIOException: server not online > regionserver-283X<cluster>,XXXXX,1727420096936 > at > org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:163) > at > org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:61) > at > org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher$BufferNode.abortOperationsInQueue(RemoteProcedureDispatcher.java:417) > at > org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.removeNode(RemoteProcedureDispatcher.java:201) > at > org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.serverRemoved(RSProcedureDispatcher.java:176) > at > org.apache.hadoop.hbase.master.ServerManager.lambda$expireServer$2(ServerManager.java:576) > at > java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) > at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) > at > org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:576) > at > org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:530) > at > org.apache.hadoop.hbase.master.RegionServerTracker.processAsActiveMaster(RegionServerTracker.java:172) > at > org.apache.hadoop.hbase.master.RegionServerTracker.refresh(RegionServerTracker.java:206) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750){code} > {code:java} > 2024-10-01 23:02:32,340 INFO [PEWorker-21] procedure2.ProcedureExecutor - > Finished pid=122448609, ppid=122448595, state=SUCCESS; > SplitWALRemoteProcedure > regionserver-33X<cluster>,XXXXX%2C1727362162836.1727822221172, > worker=regionserver-283X<cluster>,XXXXX,1727420096936 in 54.0500 sec{code} > Parent SplitWalProcedure will create another RemoteProcedure for this > {code:java} > 2024-10-01 23:02:32,726 WARN [PEWorker-17] procedure.SplitWALProcedure - > Failed to split wal > hdfs://hbase1a/hbase/WALs/regionserver-33.regionserver.hbase.<cluster>,XXXXX,1727362162836-splitting/regionserver-33X<cluster>,XXXXX%2C1727362162836.1727822221172 > by server regionserver-283.regionserver.hbase.<cluster>,XXXXX,1727420096936, > retry...{code} > {code:java} > 2024-10-01 23:02:39,414 INFO [PEWorker-28] procedure2.ProcedureExecutor - > Initialized subprocedures=[{pid=122452821, ppid=122448595, state=RUNNABLE; > SplitWALRemoteProcedure > regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172, > > worker=regionserver-323.regionserver.hbase.<cluster>,XXXXX,1727308912906}]{code} > Splitting still in progress on dying rs > {code:java} > 2024-10-01 23:02:45,652 INFO > [G_REPLAY_OPS-regionserver/regionserver-283:XXXXX-0] wal.WALSplitter - > Splitting > hdfs://hbase1a/hbase/WALs/regionserver-33.regionserver.hbase.<cluster>,XXXXX,1727362162836-splitting/regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172, > size=128.1 M (134313407bytes){code} > rs-323 creating recovered edits > {code:java} > 2024-10-01 23:02:42,876 INFO > [OPS-regionserver/regionserver-323:XXXXX-5-Writer-2] > monitor.StreamSlowMonitor - New stream slow monitor > 0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code} > {code:java} > 2024-10-01 23:02:43,171 INFO > [OPS-regionserver/regionserver-323:XXXXX-5-Writer-2] > wal.RecoveredEditsOutputSink - Creating recovered edits writer > path=hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code} > rs-283 deletes the above files and again creates the file > {code:java} > 2024-10-01 23:02:50,520 WARN > [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2] > wal.RecoveredEditsOutputSink - Found old edits file. It could be the result > of a previous failed split attempt. Deleting > hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp, > length=0{code} > {code:java} > 2024-10-01 23:02:50,794 INFO > [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2] > monitor.StreamSlowMonitor - New stream slow monitor > 0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code} > {code:java} > 2024-10-01 23:02:51,135 INFO > [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2] > wal.RecoveredEditsOutputSink - Creating recovered edits writer > path=hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code} > Now rs 323 will start failing > {code:java} > 2024-10-01 23:03:02,137 WARN [Thread-1081409] hdfs.DataStreamer - > DataStreamer Exception > java.io.FileNotFoundException: File does not exist: > /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33<cluster>%2C60020%2C1727362162836.1727822221172.temp > (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1, > pending creates: 21] > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) > at > org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1091) > at > org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1939) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForCreate(DataStreamer.java:1734) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:717) > Caused by: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33<cluster>%2C60020%2C1727362162836.1727822221172.temp > (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1, > pending creates: 21] > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567) > at org.apache.hadoop.ipc.Client.call(Client.java:1513) > at org.apache.hadoop.ipc.Client.call(Client.java:1410) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139) > at com.sun.proxy.$Proxy18.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$addBlock$11(ClientNamenodeProtocolTranslatorPB.java:495) > at > org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:495) > at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) > at com.sun.proxy.$Proxy19.addBlock(Unknown Source) > at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361) > at com.sun.proxy.$Proxy20.addBlock(Unknown Source) > at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361) > at com.sun.proxy.$Proxy20.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088) > ... 3 more > {code} > {code:java} > 2024-10-01 23:03:02,143 ERROR [split-log-closeStream-pool-1] > wal.RecoveredEditsOutputSink - Could not close recovered edits at > hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp > java.io.FileNotFoundException: File does not exist: > /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp > (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1, > pending creates: 21] > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) at > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) > at > org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1091) > at > org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1939) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForCreate(DataStreamer.java:1734) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:717) > Caused by: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp > (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1, > pending creates: 21] > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) at > org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567) > at org.apache.hadoop.ipc.Client.call(Client.java:1513) > at org.apache.hadoop.ipc.Client.call(Client.java:1410) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139) > at com.sun.proxy.$Proxy18.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$addBlock$11(ClientNamenodeProtocolTranslatorPB.java:495) > at > org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:495) > at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) > at com.sun.proxy.$Proxy19.addBlock(Unknown Source) > at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361) > at com.sun.proxy.$Proxy20.addBlock(Unknown Source) > at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361) > at com.sun.proxy.$Proxy20.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088) > {code} > > Some more info that needs to be noted is - the aborting RS starts splitting a > little late. I am adding one such case below. rs-219 is aborting one and > rs-216 is the second worker. The aborting rs started 50 second late from the > second worker although aborting rs was the one who got the RPC request first. > {code:java} > 2024-10-01 23:02:15,499 INFO > [G_REPLAY_OPS-regionserver/regionserver-216:XXXXX-3] wal.WALSplitter - > Splitting > hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024, > size=92 (92bytes){code} > {code:java} > 2024-10-01 23:03:05,793 INFO > [G_REPLAY_OPS-regionserver/regionserver-219:XXXXX-1] wal.WALSplitter - > Splitting > hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024, > size=93.4 M (97950842bytes){code} > {code:java} > 2024-10-01 23:03:15,405 INFO > [G_REPLAY_OPS-regionserver/regionserver-216:XXXXX-3] wal.WALSplitter - > Splitting > hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024, > size=93.4 M (97950842bytes){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)