[jira] [Commented] (HBASE-29786) The replication source totalBufferUsed fails to be released, causing replication blocking

Longping Jie (Jira) Thu, 18 Dec 2025 19:04:10 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-29786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18046480#comment-18046480
 ]


Longping Jie commented on HBASE-29786:
--------------------------------------

[~liuxiaocs] hi, In the clearWALEntryBatch method of ReplicationSourceShipper, 
you need to wait for both the shipper and entryReader threads to be in a 
non-alive state before cleaning up the usage quota. The shipper or entryReader 
thread may not be done yet, does it raise some other possible issues? For 
example: usage quota is cleaned up repeatedly?

> The replication source totalBufferUsed fails to be released, causing 
> replication blocking
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-29786
>                 URL: https://issues.apache.org/jira/browse/HBASE-29786
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.6.2
>            Reporter: Longping Jie
>            Priority: Major
>
>  Cluster A turns on replication to cluster B, in order to control the rate of 
> replication, in the ReplicationSourceManager class, the atomic variable 
> totalBufferUsed is added, and the acquireBufferQuota method and the 
> releaseBufferQuota method are provided to support the operation of adding or 
> subtracting atomic variables. The value increased by the totalBufferUsed 
> variable is not deducted accordingly, and the totalBufferUsed always exceeds 
> the totalBufferLimit, resulting in a dead loop, and the stack information is 
> as follows:
> "RS_REFRESH_PEER-regionserver/hbase-3:16020-0.replicationSource,hbaseOnline.replicationSource.shipperhbase-3%2C16020%2C1754317255615,hbaseOnline"
>  #738204104 daemon prio=5 os_prio=0 tid=0x0000000049d84800 nid=0x14ce2 
> waiting on condition [0x00007f01feceb000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007f17f0679610> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>         at 
> java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.poll(ReplicationSourceWALReader.java:313)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.poll(SerialReplicationSourceWALReader.java:35)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:109)
> "RS_REFRESH_PEER-regionserver/hbase-3:16020-0.replicationSource,hbaseOnline.replicationSource.wal-reader.hbase-3%2C16020%2C1754317255615,hbaseOnline"
>  #738204105 daemon prio=5 os_prio=0 tid=0x0000000049df0000 nid=0x14ce1 
> waiting on condition [0x00007f024f6f7000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.hadoop.hbase.util.Threads.sleep(Threads.java:125)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.checkBufferQuota(ReplicationSourceWALReader.java:279)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:149)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.run(SerialReplicationSourceWALReader.java:35)
> error log：
> 2025-12-18T15:43:21,817 WARN  
> [RS_REFRESH_PEER-regionserver/hbase-3:16020-0.replicationSource,hbaseOnline.replicationSource.wal-reader.hbase-3%2C16020%2C1754317255615,hbaseOnline]
>  regionserver.ReplicationSourceManager: peer=hbaseOnline, can't read more 
> edits from WAL as buffer usage 268445954B exceeds limit 268435456B



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-29786) The replication source totalBufferUsed fails to be released, causing replication blocking

Reply via email to