[ https://issues.apache.org/jira/browse/HBASE-29320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18008585#comment-18008585 ]
haosen chen commented on HBASE-29320: ------------------------------------- According to the log, the edit ID that region e3fdadb00255826881021b3baf97e976 is currently pushing is 1590, which is greater than the new barrier 1585. RS needs to wait for the pushed seqID to reach 1585. The reason should be that region e3fdadb00255826881021b3baf97e976 is back online. When the region is online, a new barrier will be set according to StoreFile and WAL to divide the stages of serial replication. Before going offline, there are edits that have not been synchronized to cluster A, that is, the latest pushed seqId recorded on zk should be less than 1585. This can be confirmed by opening the TRACE log of org.apache.hadoop.hbase.zookeeper.ZKUtil. If you have set up bidirectional replication, it is most likely this problem:https://issues.apache.org/jira/browse/HBASE-29463 > Serial replication blocking in SerialReplicationChecker#waitUntilCanPush > ------------------------------------------------------------------------ > > Key: HBASE-29320 > URL: https://issues.apache.org/jira/browse/HBASE-29320 > Project: HBase > Issue Type: Bug > Components: Replication > Affects Versions: 2.6.2 > Environment: Cluster A version 2.2.7, Cluster B version 2.6.1 > B -> A 's replication is enabled > Reporter: Longping Jie > Priority: Major > Attachments: 9398.jstack, image-2025-05-16-09-50-32-605.png, > image-2025-05-16-09-52-25-251.png > > > We have two hbase clusters that enable replciation and set up serial > replication. > Cluster A version 2.2.7, Cluster B version 2.6.1 > It is normal for a long time after replication is enabled, After that, there > is no sign, B -> A's replication source queue blocked, As shown > !image-2025-05-16-09-50-32-605.png|width=986,height=454! > I randomly select a node to see the delay of replication,As show > !image-2025-05-16-09-52-25-251.png|width=993,height=190! > Blocking code location, As follows the stack: > {code:java} > // code placeholder > "regionserver/hbase-10:16020.replicationSource.shipperhbase-10%2C16020%2C1747312708324,peerId" > #627 daemon prio=5 os_prio=0 tid=0x00007f9400b8a800 nid=0x2cf8 waiting on > condition [0x00007f44470d9000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00007f89fff3e820> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) > at > java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.poll(ReplicationSourceWALReader.java:313) > at > org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.poll(SerialReplicationSourceWALReader.java:35) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:109) > "regionserver/hbase-10:16020.replicationSource.wal-reader.hbase-10%2C16020%2C1747312708324,newHbase227" > #628 daemon prio=5 os_prio=0 tid=0x00007f9400b8c800 nid=0x2cf6 waiting on > condition [0x00007f44473da000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.waitUntilCanPush(SerialReplicationChecker.java:270) > at > org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.readWALEntries(SerialReplicationSourceWALReader.java:89) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:177) > at > org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.run(SerialReplicationSourceWALReader.java:35) > "AsyncFSWAL-0-hdfs://coreHBaseProdHa/hbase-prefix:hbase-10,16020,1747312708324" > #626 daemon prio=5 os_prio=0 tid=0x00007f9400b87000 nid=0x2cf5 waiting on > condition [0x00007f44474db000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00007f89ffe20038> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} > > Open the debug log for SerialReplicationChecker class,The logs are as follows: > 2025-05-15T21:53:29,978 DEBUG > [regionserver/hbase-10:16020.replicationSource.wal-reader.hbase-10%2C16020%2C1747312708324,peerId] > regionserver.SerialReplicationChecker: Replication barrier for > ad-instation/e3fdadb00255826881021b3baf97e976/1590=[#edits: 0 = <>]: > ReplicationBarrierResult [barriers=[1585, 1589], state=OPEN, > parentRegionNames=] > 2025-05-15T21:53:29,979 DEBUG > [regionserver/hbase-10:16020.replicationSource.wal-reader.hbase-10%2C16020%2C1747312708324,peerId] > regionserver.SerialReplicationChecker: Previous range for > ad-instation/e3fdadb00255826881021b3baf97e976/1590=[#edits: 0 = <>] has not > been finished yet, give up > 2025-05-15T21:53:29,979 DEBUG > [regionserver/hbase-10:16020.replicationSource.wal-reader.hbase-10%2C16020%2C1747312708324,peerId] > regionserver.SerialReplicationChecker: Can not push > ad-instation/e3fdadb00255826881021b3baf97e976/1590=[#edits: 0 = <>], wait > > I don't know why, replication keeps getting stuck inside > SerialReplicationChecker#waitUntilCanPush > {code:java} > // code placeholder > public void waitUntilCanPush(Entry entry, Cell firstCellInEdit) > throws IOException, InterruptedException { > byte[] row = CellUtil.cloneRow(firstCellInEdit); > while (!canPush(entry, row)) { > LOG.debug("Can not push {}, wait", entry); > Thread.sleep(waitTimeMs); > } > } {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)