[ 
https://issues.apache.org/jira/browse/HBASE-29499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HBASE-29499:
-----------------------------------
    Labels: pull-request-available  (was: )

> Serial replication stuck pushing entry with seqId equal to barrier
> ------------------------------------------------------------------
>
>                 Key: HBASE-29499
>                 URL: https://issues.apache.org/jira/browse/HBASE-29499
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.6.2
>            Reporter: Tomas
>            Priority: Major
>              Labels: pull-request-available
>
> HBase version: 2.6.2-hadoop3, 
> revision=6b3b36b429cf9a9d74110de79eb3b327b29ebf17 
> h1. Problem
> On several test HBase clusters with serial replication enabled and where 
> regionservers frequently crash / perform non-graceful shutdown, we found that 
> WAL can contain entries with seqId equal to a barrier in the meta table, e.g. 
> barriers for region X = [2, 5, 6], entry for region X seqId = 6 (equals to 
> barrier with value 6), and pushedSeqId=4 (seqId-2).
>  
> When checking if can push those entries in {_}SerialReplicationChecker{_}, 
> _canPush_ will return false, causing replication to block indefinitely.
>  
> Example 1:
> {{2025-07-22T16:12:06,070 DEBUG 
> [RS_CLAIM_REPLICATION_QUEUE-regionserver/home-host-1:16020-0.replicationSource,peer_1-home-host-1,16020,1753116284068.replicationSource.wal-reader.home-host-1%2C16020%2C1753116284068,peer_1-home-host-1,16020,1753116284068]
>  regionserver.SerialReplicationChecker: Replication barrier for 
> test_table/eb9d5e0c9147f04e0ef1296c959c2ae9/{*}39{*}=[#edits: 0 = <>]: 
> ReplicationBarrierResult [{*}barriers=[9, 17, 25, 28, 31, 34, 38, 39{*}], 
> state=OPEN, parentRegionNames=]}}
> {{2025-07-22T16:12:06,072 DEBUG 
> [RS_CLAIM_REPLICATION_QUEUE-regionserver/home-host-1:16020-0.replicationSource,peer_1-home-host-1,16020,1753116284068.replicationSource.wal-reader.home-host-1%2C16020%2C1753116284068,peer_1-home-host-1,16020,1753116284068]
>  regionserver.SerialReplicationChecker: *Previous range for 
> test_table/eb9d5e0c9147f04e0ef1296c959c2ae9/39=[#edits: 0 = <>] has not been 
> finished yet, give up*}}
> {{2025-07-22T16:12:06,072 DEBUG 
> [RS_CLAIM_REPLICATION_QUEUE-regionserver/home-host-1:16020-0.replicationSource,peer_1-home-host-1,16020,1753116284068.replicationSource.wal-reader.home-host-1%2C16020%2C1753116284068,peer_1-home-host-1,16020,1753116284068]
>  regionserver.SerialReplicationChecker: Can not push 
> test_table/eb9d5e0c9147f04e0ef1296c959c2ae9/39=[#edits: 0 = <>], wait}}
>  
>  * barriers=[9, 17, 25, 28, 31, 34, 38, 39]
>  * Entry is for HBASE::REGION_EVENT::REGION_OPEN with seqid=39 from *not the 
> last* range (replication queue is claimed).
>  * pushedSeqId=37
>  
> The previous range is calculated as 39 instead of 38, and 37 >= 39-1 is false.
>  
> See 
> [https://docs.google.com/document/d/1iB2xopSoC2IRHR8wmbGX5cmaS0RKsdFJiKeJ7EyLzeg]
>  for more supporting information (zookeeper state, WALs).
>  
> Example 2:
>  
> {{2025-08-05T07:43:53,198 DEBUG 
> [RS_CLAIM_REPLICATION_QUEUE-regionserver/regionserver-0:16020-0.replicationSource,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843.replicationSource.wal-reader.regionserver-3.hbase.hbase.svc.cluster.local%2C16020%2C1754081460813,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843
>  {}] regionserver.SerialReplicationChecker: Replication barrier for 
> aeris_v2/cfc70a1a9c3a8c459dc4b79ece6d1ebd/{*}650974464{*}=[#edits: 0 = <>]: 
> ReplicationBarrierResult [barriers=[649436971, {*}650974464{*}, 650990494, 
> 651037843, 651092522, 651096754, 651118516, 651147941, 651173589], 
> state=OPEN, parentRegionNames=]}}
> {{2025-08-05T07:43:53,199 DEBUG 
> [RS_CLAIM_REPLICATION_QUEUE-regionserver/regionserver-0:16020-0.replicationSource,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843.replicationSource.wal-reader.regionserver-3.hbase.hbase.svc.cluster.local%2C16020%2C1754081460813,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843
>  {}] regionserver.SerialReplicationChecker: *Previous range for 
> aeris_v2/cfc70a1a9c3a8c459dc4b79ece6d1ebd/650974464=[#edits: 0 = <>] has not 
> been finished yet, give up*}}
> {{2025-08-05T07:43:53,199 DEBUG 
> [RS_CLAIM_REPLICATION_QUEUE-regionserver/regionserver-0:16020-0.replicationSource,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843.replicationSource.wal-reader.regionserver-3.hbase.hbase.svc.cluster.local%2C16020%2C1754081460813,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843
>  {}] regionserver.SerialReplicationChecker: Can not push 
> aeris_v2/cfc70a1a9c3a8c459dc4b79ece6d1ebd/650974464=[#edits: 0 = <>], wait}}
>  
>  * barriers=[649436971, 650974464, 650990494, …]
>  * Entry is with seqid=650974464 from *not the last* range (replication queue 
> is claimed).
>  * pushedSeqId=650974462
>  
> The previous range is calculated as 650974464 instead of 649436971, and 
> 650974462 >= 650974464-1 is false.
> h1. Impact
> Replication is blocked indefinitely for regions that contain the problematic 
> entry.
> Entries with higher seqId than the problematic entry cannot be replicated due 
> to previous range(s) not being finished yet.
> Metric _sizeoflogqueue_ grows indefinitely as data gets written to the 
> region(s) and WAL's are rolled. 
> h1. Workarounds
> N/A. 
> Turn off serial mode and replicate non-serially OR remove and re-add peer to 
> restart replication (will have a gap in data replicated).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to