[
https://issues.apache.org/jira/browse/HBASE-29110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18008109#comment-18008109
]
Umesh Kumar Kumawat commented on HBASE-29110:
---------------------------------------------
I think this is the stack trace -
{noformat}
2024-12-05 14:55:55,213 ERROR [split-log-closeStream-pool-1]
wal.RecoveredEditsOutputSink - Could not close recovered edits at
hdfs://hbase33a/hbase/data/default/tsdb-uid/6433937152984a5031c15d94b6f0a507/recovered.edits/0000000000072643127-regionserver-53.XXX%2C60020%2C1730886070174.1733408631854.tempjava.io.InterruptedIOException:
No ack received after 25s and a timeout of 25s
at
org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:938)
at
org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:778)
at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:888)
at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:847)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:77)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
at
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.close(ProtobufLogWriter.java:70)
at
org.apache.hadoop.hbase.wal.AbstractRecoveredEditsOutputSink.closeRecoveredEditsWriter(AbstractRecoveredEditsOutputSink.java:79)
at
org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink.lambda$closeWriters$0(RecoveredEditsOutputSink.java:109)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
{noformat}
Version that we were using was a little different from now. I think it was like
[this|https://github.com/apache/hbase/blame/fd0803a96b153e5e60504ee54d155ade4deb17cc/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/RecoveredEditsOutputSink.java#L105-L115]
. At least HBASE-28569 is introduced after that.
> HMaster SplitWalProcedure stays RUNNABLE state infinitely if Worker
> Regionserver doesn't reports back SplitWALRemoteProcedure RPC response to
> HMaster
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-29110
> URL: https://issues.apache.org/jira/browse/HBASE-29110
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 3.0.0-beta-1, 2.6.1, 2.5.10
> Reporter: Nirdosh Kumar Yadav
> Assignee: Umesh Kumar Kumawat
> Priority: Minor
>
> In an HBase cluster, we encountered a scenario where the RegionServer crash
> procedure (SCP) took for more than 3 hours to complete. The incident was
> triggered by temporary network unavailability in the HBase cluster. Upon
> debugging, we found that the SCP was stuck due to the child
> SplitWALProcedure, which was waiting for the completion of the SplitWALRemote
> procedure by the RegionServer worker.
> While executing, the SplitWALRemote procedure encountered an unknown
> exception. The logs show an error message: "hdfs.DataStreamer - No ack
> received," indicating an issue with the RegionServer's connection to the
> DataNode. After this error, the thread either became stuck or terminated, as
> no related logs were available.
> During this period, inconsistent regions were reported. Worker RS was healthy
> and didn't reported any signs of abort or unhealthy. All procedures were
> restarted and completed successfully after the Active HMaster service was
> bounced.
> Here if we can setTimeout(<backoff>) for retrying the SplitWAL procedure,
> Procedure will wake up and status can be checked paused procedure
> Related Bug: HBASE-29041 on RegionServer component.
> Related logs:
> [HMASTER-4]
> 2024-12-05 14:55:11,264 INFO [PEWorker-41] procedure2.ProcedureExecutor -
> Initialized subprocedures=[
> {pid=6003288, ppid=6002575, state=RUNNABLE;
> SplitWALRemoteProcedureregionserver-53.regionserver.hbase.hbasexxxxxx.
> %2Cxxxxx%2C1730886070174.1733410178680,
> worker=regionserver-15.regionserver.xxxxx,1730878238028}
> ]
> [RS-15]
> 2024-12-05 14:55:11,461 DEBUG
> [iority.RWQ.Fifo.read.handler=83,queue=2,port=xxxx]
> regionserver.RSRpcServices - Executing remote procedure
> classorg.apache.hadoop.hbase.regionserver.SplitWALCallable, pid=6003288
> [RS-15]
> 2024-12-05 14:55:54,689 ERROR [split-log-closeStream-pool-0]
> hdfs.DataStreamer - No ack received, took 25002ms (threshold=25000ms). File
> being written:
> /hbase/data/default/tsdb/c997c5f8dd36481dcd3ebb9b79a35b51/recovered.edits/0000000000539451088-regionserver-53.regionserver.hbase.hbasexxa.xxxxx.is%2Cxxxxx%2C1730886070174.1733410178680.temp,
> block: BP-1745262640-10.60.130.13-1712173738392:blk_1330710217_257120451,
> Write pipeline datanodes:
> [DatanodeInfoWithStorage[10.60.52.107:xxxxx,DS-f2b7ba1a-68b5-433a-9fe8-99315a172098,SSD],
>
> DatanodeInfoWithStorage[10.60.75.52:xxxxxx,DS-93a433be-972f-4457-92ae-dd07288e41b5,SSD]].
> [HMASTER-1]
> 2024-12-05 18:11:42,036 DEBUG [master/hmaster-1:xxxxx:becomeActiveMaster]
> store.ProcedureTree -Procedure Procedure(pid=6003288, ppid=6002575,
> class=org.apache.hadoop.hbase.master.procedure.SplitWALRemoteProcedure) stack
> ids=[3592]
> [HMASTER-1]
> 2024-12-05 18:11:42,214 DEBUG [master/hmaster-1:xxxxx:becomeActiveMaster]
> procedure2.ProcedureExecutor - Loading pid=6003288, ppid=6002575,
> state=RUNNABLE; SplitWALRemoteProcedure
> regionserver-53.regionserver.hbase.xxxxxx.is%2C60020%2C1730886070174.1733410178680,
>
> worker=regionserver-15.regionserver.hbase.hbasexxa.xxxxx.is,xxxxx,1730878238028
> [RS-15]
> 2024-12-05 18:11:42,769 DEBUG
> [iority.RWQ.Fifo.read.handler=79,queue=7,port=60020]
> regionserver.RSRpcServices - Executing remote procedure
> classorg.apache.hadoop.hbase.regionserver.SplitWALCallable, pid=6003288
> [RS-15]
> 2024-12-05 18:11:48,247 DEBUG
> [_REPLAY_OPS-regionserver/regionserver-15:xxxxx-192]
> regionserver.RemoteProcedureResultReporter - Successfully complete execution
> of pid=6003288
> [HMASTER-1]
> 2024-12-05 18:11:48,304 INFO [PEWorker-2] procedure2.ProcedureExecutor -
> Finished pid=6002575, ppid=6000775, state=SUCCESS;
> SplitWALProcedureregionserver-53.regionserver.hbase.hbasexxa.xxxxxxx.is%2Cxxxxx%2C1730886070174.1733410178680,
>
> worker=regionserver-15.regionserver.hbase.hbasexxa.xxxxxx,xxxxxx,1730878238028
> in 3 hrs, 16 mins, 52.806 sec
--
This message was sent by Atlassian Jira
(v8.20.10#820010)