[
https://issues.apache.org/jira/browse/HBASE-29786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18046480#comment-18046480
]
Longping Jie commented on HBASE-29786:
--------------------------------------
[~liuxiaocs] hi, In the clearWALEntryBatch method of ReplicationSourceShipper,
you need to wait for both the shipper and entryReader threads to be in a
non-alive state before cleaning up the usage quota. The shipper or entryReader
thread may not be done yet, does it raise some other possible issues? For
example: usage quota is cleaned up repeatedly?
> The replication source totalBufferUsed fails to be released, causing
> replication blocking
> -----------------------------------------------------------------------------------------
>
> Key: HBASE-29786
> URL: https://issues.apache.org/jira/browse/HBASE-29786
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 2.6.2
> Reporter: Longping Jie
> Priority: Major
>
> Cluster A turns on replication to cluster B, in order to control the rate of
> replication, in the ReplicationSourceManager class, the atomic variable
> totalBufferUsed is added, and the acquireBufferQuota method and the
> releaseBufferQuota method are provided to support the operation of adding or
> subtracting atomic variables. The value increased by the totalBufferUsed
> variable is not deducted accordingly, and the totalBufferUsed always exceeds
> the totalBufferLimit, resulting in a dead loop, and the stack information is
> as follows:
> "RS_REFRESH_PEER-regionserver/hbase-3:16020-0.replicationSource,hbaseOnline.replicationSource.shipperhbase-3%2C16020%2C1754317255615,hbaseOnline"
> #738204104 daemon prio=5 os_prio=0 tid=0x0000000049d84800 nid=0x14ce2
> waiting on condition [0x00007f01feceb000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00007f17f0679610> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> at
> java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.poll(ReplicationSourceWALReader.java:313)
> at
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.poll(SerialReplicationSourceWALReader.java:35)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:109)
> "RS_REFRESH_PEER-regionserver/hbase-3:16020-0.replicationSource,hbaseOnline.replicationSource.wal-reader.hbase-3%2C16020%2C1754317255615,hbaseOnline"
> #738204105 daemon prio=5 os_prio=0 tid=0x0000000049df0000 nid=0x14ce1
> waiting on condition [0x00007f024f6f7000]
> java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at org.apache.hadoop.hbase.util.Threads.sleep(Threads.java:125)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.checkBufferQuota(ReplicationSourceWALReader.java:279)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:149)
> at
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.run(SerialReplicationSourceWALReader.java:35)
> error log:
> 2025-12-18T15:43:21,817 WARN
> [RS_REFRESH_PEER-regionserver/hbase-3:16020-0.replicationSource,hbaseOnline.replicationSource.wal-reader.hbase-3%2C16020%2C1754317255615,hbaseOnline]
> regionserver.ReplicationSourceManager: peer=hbaseOnline, can't read more
> edits from WAL as buffer usage 268445954B exceeds limit 268435456B
--
This message was sent by Atlassian Jira
(v8.20.10#820010)