[ 
https://issues.apache.org/jira/browse/HBASE-28951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17897075#comment-17897075
 ] 

Viraj Jasani edited comment on HBASE-28951 at 11/11/24 5:26 AM:
----------------------------------------------------------------

[~umesh9414] when the aborting server started WAL splitting, when exactly did 
we get log of the server abort? Was it before the server started WAL splitting 
or after?

If you could add server abort error log, we can compare timestamp of these 
events.


was (Author: vjasani):
[~umesh9414] when the aborting server started WAL splitting, when exactly did 
we get log of the server abort? Was it before the server started WAL splitting 
or after?

> WAL Split Delays Due to Concurrent WAL Splitting During worker RS Abort
> -----------------------------------------------------------------------
>
>                 Key: HBASE-28951
>                 URL: https://issues.apache.org/jira/browse/HBASE-28951
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.5.8
>            Reporter: Umesh Kumar Kumawat
>            Priority: Major
>
> When a worker RS gets aborted after the SplitWALRemoteProcedure got 
> dispatched, RegionServerTracker takes care of it and [aborts the pending 
> Operation|https://github.com/apache/hbase/blob/rel/2.5.8/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java#L160]
>  on the aborting region as part of 
> [expireServer|https://github.com/apache/hbase/blob/rel/2.5.8/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionServerTracker.java#L172].
>  
> It did help the parent procedure, SplitWalProcedure, to choose another worker 
> RS but the aborting RS is also splitting the WAL. Now while creating the 
> recovered edits both will try to write the same file. One RS that starts late 
> for the file deletes the previous file that cause failures. 
> h4. Logs - 
> region server tracker marking the remove procedure failed
> {code:java}
> 2024-10-01 23:02:32,274 WARN [RegionServerTracker-0] 
> procedure.SplitWALRemoteProcedure - Sent 
> hdfs://hbase1a/hbase/WALs/regionserver-33.regionserver.hbase.<cluster>,XXXXX,1727362162836-splitting/regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172
>  to wrong server 
> regionserver-283.regionserver.hbase.<cluster>,XXXXX,1727420096936, try another
> org.apache.hadoop.hbase.DoNotRetryIOException: server not online 
> regionserver-283.regionserver.hbase.<cluster>,XXXXX,1727420096936
> at 
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:163)
> at 
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:61)
> at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher$BufferNode.abortOperationsInQueue(RemoteProcedureDispatcher.java:417)
> at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.removeNode(RemoteProcedureDispatcher.java:201)
> at 
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.serverRemoved(RSProcedureDispatcher.java:176)
> at 
> org.apache.hadoop.hbase.master.ServerManager.lambda$expireServer$2(ServerManager.java:576)
> at 
> java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
> at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
> at 
> org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:576)
> at 
> org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:530)
> at 
> org.apache.hadoop.hbase.master.RegionServerTracker.processAsActiveMaster(RegionServerTracker.java:172)
> at 
> org.apache.hadoop.hbase.master.RegionServerTracker.refresh(RegionServerTracker.java:206)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:750){code}
> {code:java}
> 2024-10-01 23:02:32,340 INFO [PEWorker-21] procedure2.ProcedureExecutor - 
> Finished pid=122448609, ppid=122448595, state=SUCCESS; 
> SplitWALRemoteProcedure 
> regionserver-33.regionserver.hbase.<cluster>,XXXXX%2C1727362162836.1727822221172,
>  worker=regionserver-283.regionserver.hbase.<cluster>,XXXXX,1727420096936 in 
> 54.0500 sec{code}
> Parent SplitWalProcedure will create another RemoteProcedure for this 
> {code:java}
> 2024-10-01 23:02:32,726 WARN [PEWorker-17] procedure.SplitWALProcedure - 
> Failed to split wal 
> hdfs://hbase1a/hbase/WALs/regionserver-33.regionserver.hbase.<cluster>,XXXXX,1727362162836-splitting/regionserver-33.regionserver.hbase.<cluster>,XXXXX%2C1727362162836.1727822221172
>  by server regionserver-283.regionserver.hbase.<cluster>,XXXXX,1727420096936, 
> retry...{code}
> {code:java}
> 2024-10-01 23:02:39,414 INFO [PEWorker-28] procedure2.ProcedureExecutor - 
> Initialized subprocedures=[{pid=122452821, ppid=122448595, state=RUNNABLE; 
> SplitWALRemoteProcedure 
> regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172,
>  
> worker=regionserver-323.regionserver.hbase.<cluster>,XXXXX,1727308912906}]{code}
> Splitting still in progress on dying rs 
> {code:java}
> 2024-10-01 23:02:45,652 INFO 
> [G_REPLAY_OPS-regionserver/regionserver-283:XXXXX-0] wal.WALSplitter - 
> Splitting 
> hdfs://hbase1a/hbase/WALs/regionserver-33.regionserver.hbase.<cluster>,XXXXX,1727362162836-splitting/regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172,
>  size=128.1 M (134313407bytes){code}
> rs-323 creating recovered edits
> {code:java}
> 2024-10-01 23:02:42,876 INFO 
> [OPS-regionserver/regionserver-323:XXXXX-5-Writer-2] 
> monitor.StreamSlowMonitor - New stream slow monitor 
> 0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code}
> {code:java}
> 2024-10-01 23:02:43,171 INFO 
> [OPS-regionserver/regionserver-323:XXXXX-5-Writer-2] 
> wal.RecoveredEditsOutputSink - Creating recovered edits writer 
> path=hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code}
> rs-283 deletes the above files and again creates the file 
> {code:java}
> 2024-10-01 23:02:50,520 WARN 
> [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2] 
> wal.RecoveredEditsOutputSink - Found old edits file. It could be the result 
> of a previous failed split attempt. Deleting 
> hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp,
>  length=0{code}
> {code:java}
> 2024-10-01 23:02:50,794 INFO 
> [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2] 
> monitor.StreamSlowMonitor - New stream slow monitor 
> 0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code}
> {code:java}
> 2024-10-01 23:02:51,135 INFO  
> [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2] 
> wal.RecoveredEditsOutputSink - Creating recovered edits writer 
> path=hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code}
> Now rs 323 will start failing 
> {code:java}
> 2024-10-01 23:03:02,137 WARN  [Thread-1081409] hdfs.DataStreamer - 
> DataStreamer Exception
> java.io.FileNotFoundException: File does not exist: 
> /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.hbase1a.hbase.core2.aws-prod5-uswest2.aws.sfdc.is%2C60020%2C1727362162836.1727822221172.temp
>  (inode 1440741238) [Lease.  Holder: DFSClient_NONMAPREDUCE_-2039838105_1, 
> pending creates: 21]
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595)
>     at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060)
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>     at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>     at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>     at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>     at 
> org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1091)
>     at 
> org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1939)
>     at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForCreate(DataStreamer.java:1734)
>     at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:717)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
> does not exist: 
> /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.hbase1a.hbase.core2.aws-prod5-uswest2.aws.sfdc.is%2C60020%2C1727362162836.1727822221172.temp
>  (inode 1440741238) [Lease.  Holder: DFSClient_NONMAPREDUCE_-2039838105_1, 
> pending creates: 21]
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595)
>     at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060)
>     at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1513)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1410)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
>     at com.sun.proxy.$Proxy18.addBlock(Unknown Source)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$addBlock$11(ClientNamenodeProtocolTranslatorPB.java:495)
>     at 
> org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:495)
>     at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
>     at com.sun.proxy.$Proxy19.addBlock(Unknown Source)
>     at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361)
>     at com.sun.proxy.$Proxy20.addBlock(Unknown Source)
>     at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361)
>     at com.sun.proxy.$Proxy20.addBlock(Unknown Source)
>     at 
> org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088)
>     ... 3 more
> {code}
> {code:java}
> 2024-10-01 23:03:02,143 ERROR [split-log-closeStream-pool-1] 
> wal.RecoveredEditsOutputSink - Could not close recovered edits at 
> hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp
> java.io.FileNotFoundException: File does not exist: 
> /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp
>  (inode 1440741238) [Lease.  Holder: DFSClient_NONMAPREDUCE_-2039838105_1, 
> pending creates: 21]
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595)
>     at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060)    at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>     at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>     at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>     at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>     at 
> org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1091)
>     at 
> org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1939)
>     at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForCreate(DataStreamer.java:1734)
>     at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:717)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File 
> does not exist: 
> /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp
>  (inode 1440741238) [Lease.  Holder: DFSClient_NONMAPREDUCE_-2039838105_1, 
> pending creates: 21]
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595)
>     at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060)    at 
> org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1513)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1410)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
>     at com.sun.proxy.$Proxy18.addBlock(Unknown Source)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$addBlock$11(ClientNamenodeProtocolTranslatorPB.java:495)
>     at 
> org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:495)
>     at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
>     at com.sun.proxy.$Proxy19.addBlock(Unknown Source)
>     at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361)
>     at com.sun.proxy.$Proxy20.addBlock(Unknown Source)
>     at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361)
>     at com.sun.proxy.$Proxy20.addBlock(Unknown Source)
>     at 
> org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088) 
> {code}
>  
> Some more info that needs to be noted is - the aborting RS starts splitting a 
> little late. I am adding one such case below. rs-219 is aborting one and 
> rs-216 is the second worker. The aborting rs started 50 the second late from 
> the second worker although aborting rs was the one who got the RPC request 
> first.
> {code:java}
> 2024-10-01 23:02:15,499 INFO 
> [G_REPLAY_OPS-regionserver/regionserver-216:XXXXX-3] wal.WALSplitter - 
> Splitting 
> hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024,
>  size=92 (92bytes){code}
> {code:java}
> 2024-10-01 23:03:05,793 INFO 
> [G_REPLAY_OPS-regionserver/regionserver-219:XXXXX-1] wal.WALSplitter - 
> Splitting 
> hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024,
>  size=93.4 M (97950842bytes){code}
> {code:java}
> 2024-10-01 23:03:15,405 INFO 
> [G_REPLAY_OPS-regionserver/regionserver-216:XXXXX-3] wal.WALSplitter - 
> Splitting 
> hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024,
>  size=93.4 M (97950842bytes){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to