[ 
https://issues.apache.org/jira/browse/HBASE-29110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18008109#comment-18008109
 ] 

Umesh Kumar Kumawat commented on HBASE-29110:
---------------------------------------------

I think this is the stack trace - 

{noformat}
2024-12-05 14:55:55,213 ERROR [split-log-closeStream-pool-1] 
wal.RecoveredEditsOutputSink - Could not close recovered edits at 
hdfs://hbase33a/hbase/data/default/tsdb-uid/6433937152984a5031c15d94b6f0a507/recovered.edits/0000000000072643127-regionserver-53.XXX%2C60020%2C1730886070174.1733408631854.tempjava.io.InterruptedIOException:
 No ack received after 25s and a timeout of 25s
 at 
org.apache.hadoop.hdfs.DataStreamer.waitForAckedSeqno(DataStreamer.java:938) 
 at 
org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:778)
 at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:888)
 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:847)
 at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:77)
 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
 at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.close(ProtobufLogWriter.java:70)
 at 
org.apache.hadoop.hbase.wal.AbstractRecoveredEditsOutputSink.closeRecoveredEditsWriter(AbstractRecoveredEditsOutputSink.java:79)
 at 
org.apache.hadoop.hbase.wal.RecoveredEditsOutputSink.lambda$closeWriters$0(RecoveredEditsOutputSink.java:109)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:750)
{noformat}

Version that we were using was a little different from now. I think it was like 
[this|https://github.com/apache/hbase/blame/fd0803a96b153e5e60504ee54d155ade4deb17cc/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/RecoveredEditsOutputSink.java#L105-L115]
 . At least HBASE-28569 is introduced after that. 

> HMaster SplitWalProcedure stays RUNNABLE state infinitely if Worker 
> Regionserver doesn't reports back SplitWALRemoteProcedure RPC response to 
> HMaster
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-29110
>                 URL: https://issues.apache.org/jira/browse/HBASE-29110
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 3.0.0-beta-1, 2.6.1, 2.5.10
>            Reporter: Nirdosh Kumar Yadav
>            Assignee: Umesh Kumar Kumawat
>            Priority: Minor
>
> In an HBase cluster, we encountered a scenario where the RegionServer crash 
> procedure (SCP) took for more than 3 hours to complete. The incident was 
> triggered by temporary network unavailability in the HBase cluster. Upon 
> debugging, we found that the SCP was stuck due to the child 
> SplitWALProcedure, which was waiting for the completion of the SplitWALRemote 
> procedure by the RegionServer worker.
> While executing, the SplitWALRemote procedure encountered an unknown 
> exception. The logs show an error message: "hdfs.DataStreamer - No ack 
> received," indicating an issue with the RegionServer's connection to the 
> DataNode. After this error, the thread either became stuck or terminated, as 
> no related logs were available.
> During this period, inconsistent regions were reported. Worker RS was healthy 
> and didn't reported any signs of abort or unhealthy. All procedures were 
> restarted and completed successfully after the Active HMaster service was 
> bounced.
> Here if we can setTimeout(<backoff>) for retrying the SplitWAL procedure, 
> Procedure will wake up and status can be checked paused procedure
> Related Bug: HBASE-29041 on RegionServer component.
> Related logs:
> [HMASTER-4]
> 2024-12-05 14:55:11,264 INFO [PEWorker-41] procedure2.ProcedureExecutor - 
> Initialized subprocedures=[
> {pid=6003288, ppid=6002575, state=RUNNABLE; 
> SplitWALRemoteProcedureregionserver-53.regionserver.hbase.hbasexxxxxx. 
> %2Cxxxxx%2C1730886070174.1733410178680, 
> worker=regionserver-15.regionserver.xxxxx,1730878238028}
> ]
> [RS-15]
> 2024-12-05 14:55:11,461 DEBUG 
> [iority.RWQ.Fifo.read.handler=83,queue=2,port=xxxx] 
> regionserver.RSRpcServices - Executing remote procedure 
> classorg.apache.hadoop.hbase.regionserver.SplitWALCallable, pid=6003288
> [RS-15]
> 2024-12-05 14:55:54,689 ERROR [split-log-closeStream-pool-0] 
> hdfs.DataStreamer - No ack received, took 25002ms (threshold=25000ms). File 
> being written: 
> /hbase/data/default/tsdb/c997c5f8dd36481dcd3ebb9b79a35b51/recovered.edits/0000000000539451088-regionserver-53.regionserver.hbase.hbasexxa.xxxxx.is%2Cxxxxx%2C1730886070174.1733410178680.temp,
>  block: BP-1745262640-10.60.130.13-1712173738392:blk_1330710217_257120451, 
> Write pipeline datanodes: 
> [DatanodeInfoWithStorage[10.60.52.107:xxxxx,DS-f2b7ba1a-68b5-433a-9fe8-99315a172098,SSD],
>  
> DatanodeInfoWithStorage[10.60.75.52:xxxxxx,DS-93a433be-972f-4457-92ae-dd07288e41b5,SSD]].
> [HMASTER-1]
> 2024-12-05 18:11:42,036 DEBUG [master/hmaster-1:xxxxx:becomeActiveMaster] 
> store.ProcedureTree -Procedure Procedure(pid=6003288, ppid=6002575, 
> class=org.apache.hadoop.hbase.master.procedure.SplitWALRemoteProcedure) stack 
> ids=[3592]
> [HMASTER-1]
> 2024-12-05 18:11:42,214 DEBUG [master/hmaster-1:xxxxx:becomeActiveMaster] 
> procedure2.ProcedureExecutor - Loading pid=6003288, ppid=6002575, 
> state=RUNNABLE; SplitWALRemoteProcedure 
> regionserver-53.regionserver.hbase.xxxxxx.is%2C60020%2C1730886070174.1733410178680,
>  
> worker=regionserver-15.regionserver.hbase.hbasexxa.xxxxx.is,xxxxx,1730878238028
> [RS-15]
> 2024-12-05 18:11:42,769 DEBUG 
> [iority.RWQ.Fifo.read.handler=79,queue=7,port=60020] 
> regionserver.RSRpcServices - Executing remote procedure 
> classorg.apache.hadoop.hbase.regionserver.SplitWALCallable, pid=6003288
> [RS-15]
> 2024-12-05 18:11:48,247 DEBUG 
> [_REPLAY_OPS-regionserver/regionserver-15:xxxxx-192] 
> regionserver.RemoteProcedureResultReporter - Successfully complete execution 
> of pid=6003288
> [HMASTER-1]
> 2024-12-05 18:11:48,304 INFO [PEWorker-2] procedure2.ProcedureExecutor - 
> Finished pid=6002575, ppid=6000775, state=SUCCESS; 
> SplitWALProcedureregionserver-53.regionserver.hbase.hbasexxa.xxxxxxx.is%2Cxxxxx%2C1730886070174.1733410178680,
>  
> worker=regionserver-15.regionserver.hbase.hbasexxa.xxxxxx,xxxxxx,1730878238028
>  in 3 hrs, 16 mins, 52.806 sec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to