[ https://issues.apache.org/jira/browse/HBASE-28951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HBASE-28951: ----------------------------------- Labels: pull-request-available (was: ) > WAL Split Delays Due to Concurrent WAL Splitting During worker RS Abort > ----------------------------------------------------------------------- > > Key: HBASE-28951 > URL: https://issues.apache.org/jira/browse/HBASE-28951 > Project: HBase > Issue Type: Bug > Affects Versions: 2.5.8 > Reporter: Umesh Kumar Kumawat > Assignee: Umesh Kumar Kumawat > Priority: Major > Labels: pull-request-available > > When a worker RS gets aborted after the SplitWALRemoteProcedure got > dispatched, RegionServerTracker takes care of it and [aborts the pending > Operation|https://github.com/apache/hbase/blob/rel/2.5.8/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java#L160] > on the aborting region as part of > [expireServer|https://github.com/apache/hbase/blob/rel/2.5.8/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionServerTracker.java#L172]. > > It did help the parent procedure, SplitWalProcedure, to choose another worker > RS but the aborting RS is also splitting the WAL. Now while creating the > recovered edits both will try to write the same file. One RS that starts late > for the file deletes the previous file that cause failures. > h4. Logs - > region server tracker marking the remove procedure failed > {code:java} > 2024-10-01 23:02:32,274 WARN [RegionServerTracker-0] > procedure.SplitWALRemoteProcedure - Sent > hdfs://hbase1a/hbase/WALs/regionserver-33.regionserver.hbase.<cluster>,XXXXX,1727362162836-splitting/regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172 > to wrong server > regionserver-283.regionserver.hbase.<cluster>,XXXXX,1727420096936, try another > org.apache.hadoop.hbase.DoNotRetryIOException: server not online > regionserver-283.regionserver.hbase.<cluster>,XXXXX,1727420096936 > at > org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:163) > at > org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:61) > at > org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher$BufferNode.abortOperationsInQueue(RemoteProcedureDispatcher.java:417) > at > org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.removeNode(RemoteProcedureDispatcher.java:201) > at > org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.serverRemoved(RSProcedureDispatcher.java:176) > at > org.apache.hadoop.hbase.master.ServerManager.lambda$expireServer$2(ServerManager.java:576) > at > java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) > at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) > at > org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:576) > at > org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:530) > at > org.apache.hadoop.hbase.master.RegionServerTracker.processAsActiveMaster(RegionServerTracker.java:172) > at > org.apache.hadoop.hbase.master.RegionServerTracker.refresh(RegionServerTracker.java:206) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750){code} > {code:java} > 2024-10-01 23:02:32,340 INFO [PEWorker-21] procedure2.ProcedureExecutor - > Finished pid=122448609, ppid=122448595, state=SUCCESS; > SplitWALRemoteProcedure > regionserver-33.regionserver.hbase.<cluster>,XXXXX%2C1727362162836.1727822221172, > worker=regionserver-283.regionserver.hbase.<cluster>,XXXXX,1727420096936 in > 54.0500 sec{code} > Parent SplitWalProcedure will create another RemoteProcedure for this > {code:java} > 2024-10-01 23:02:32,726 WARN [PEWorker-17] procedure.SplitWALProcedure - > Failed to split wal > hdfs://hbase1a/hbase/WALs/regionserver-33.regionserver.hbase.<cluster>,XXXXX,1727362162836-splitting/regionserver-33.regionserver.hbase.<cluster>,XXXXX%2C1727362162836.1727822221172 > by server regionserver-283.regionserver.hbase.<cluster>,XXXXX,1727420096936, > retry...{code} > {code:java} > 2024-10-01 23:02:39,414 INFO [PEWorker-28] procedure2.ProcedureExecutor - > Initialized subprocedures=[{pid=122452821, ppid=122448595, state=RUNNABLE; > SplitWALRemoteProcedure > regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172, > > worker=regionserver-323.regionserver.hbase.<cluster>,XXXXX,1727308912906}]{code} > Splitting still in progress on dying rs > {code:java} > 2024-10-01 23:02:45,652 INFO > [G_REPLAY_OPS-regionserver/regionserver-283:XXXXX-0] wal.WALSplitter - > Splitting > hdfs://hbase1a/hbase/WALs/regionserver-33.regionserver.hbase.<cluster>,XXXXX,1727362162836-splitting/regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172, > size=128.1 M (134313407bytes){code} > rs-323 creating recovered edits > {code:java} > 2024-10-01 23:02:42,876 INFO > [OPS-regionserver/regionserver-323:XXXXX-5-Writer-2] > monitor.StreamSlowMonitor - New stream slow monitor > 0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code} > {code:java} > 2024-10-01 23:02:43,171 INFO > [OPS-regionserver/regionserver-323:XXXXX-5-Writer-2] > wal.RecoveredEditsOutputSink - Creating recovered edits writer > path=hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code} > rs-283 deletes the above files and again creates the file > {code:java} > 2024-10-01 23:02:50,520 WARN > [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2] > wal.RecoveredEditsOutputSink - Found old edits file. It could be the result > of a previous failed split attempt. Deleting > hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp, > length=0{code} > {code:java} > 2024-10-01 23:02:50,794 INFO > [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2] > monitor.StreamSlowMonitor - New stream slow monitor > 0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code} > {code:java} > 2024-10-01 23:02:51,135 INFO > [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2] > wal.RecoveredEditsOutputSink - Creating recovered edits writer > path=hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code} > Now rs 323 will start failing > {code:java} > 2024-10-01 23:03:02,137 WARN [Thread-1081409] hdfs.DataStreamer - > DataStreamer Exception > java.io.FileNotFoundException: File does not exist: > /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.hbase1a.hbase.core2.aws-prod5-uswest2.aws.sfdc.is%2C60020%2C1727362162836.1727822221172.temp > (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1, > pending creates: 21] > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) > at > org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1091) > at > org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1939) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForCreate(DataStreamer.java:1734) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:717) > Caused by: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.hbase1a.hbase.core2.aws-prod5-uswest2.aws.sfdc.is%2C60020%2C1727362162836.1727822221172.temp > (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1, > pending creates: 21] > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567) > at org.apache.hadoop.ipc.Client.call(Client.java:1513) > at org.apache.hadoop.ipc.Client.call(Client.java:1410) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139) > at com.sun.proxy.$Proxy18.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$addBlock$11(ClientNamenodeProtocolTranslatorPB.java:495) > at > org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:495) > at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) > at com.sun.proxy.$Proxy19.addBlock(Unknown Source) > at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361) > at com.sun.proxy.$Proxy20.addBlock(Unknown Source) > at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361) > at com.sun.proxy.$Proxy20.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088) > ... 3 more > {code} > {code:java} > 2024-10-01 23:03:02,143 ERROR [split-log-closeStream-pool-1] > wal.RecoveredEditsOutputSink - Could not close recovered edits at > hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp > java.io.FileNotFoundException: File does not exist: > /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp > (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1, > pending creates: 21] > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) at > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) > at > org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1091) > at > org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1939) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForCreate(DataStreamer.java:1734) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:717) > Caused by: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp > (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1, > pending creates: 21] > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) at > org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567) > at org.apache.hadoop.ipc.Client.call(Client.java:1513) > at org.apache.hadoop.ipc.Client.call(Client.java:1410) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139) > at com.sun.proxy.$Proxy18.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$addBlock$11(ClientNamenodeProtocolTranslatorPB.java:495) > at > org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:495) > at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) > at com.sun.proxy.$Proxy19.addBlock(Unknown Source) > at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361) > at com.sun.proxy.$Proxy20.addBlock(Unknown Source) > at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361) > at com.sun.proxy.$Proxy20.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088) > {code} > > Some more info that needs to be noted is - the aborting RS starts splitting a > little late. I am adding one such case below. rs-219 is aborting one and > rs-216 is the second worker. The aborting rs started 50 the second late from > the second worker although aborting rs was the one who got the RPC request > first. > {code:java} > 2024-10-01 23:02:15,499 INFO > [G_REPLAY_OPS-regionserver/regionserver-216:XXXXX-3] wal.WALSplitter - > Splitting > hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024, > size=92 (92bytes){code} > {code:java} > 2024-10-01 23:03:05,793 INFO > [G_REPLAY_OPS-regionserver/regionserver-219:XXXXX-1] wal.WALSplitter - > Splitting > hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024, > size=93.4 M (97950842bytes){code} > {code:java} > 2024-10-01 23:03:15,405 INFO > [G_REPLAY_OPS-regionserver/regionserver-216:XXXXX-3] wal.WALSplitter - > Splitting > hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024, > size=93.4 M (97950842bytes){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)