[
https://issues.apache.org/jira/browse/HBASE-29499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HBASE-29499:
-----------------------------------
Labels: pull-request-available (was: )
> Serial replication stuck pushing entry with seqId equal to barrier
> ------------------------------------------------------------------
>
> Key: HBASE-29499
> URL: https://issues.apache.org/jira/browse/HBASE-29499
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 2.6.2
> Reporter: Tomas
> Priority: Major
> Labels: pull-request-available
>
> HBase version: 2.6.2-hadoop3,
> revision=6b3b36b429cf9a9d74110de79eb3b327b29ebf17
> h1. Problem
> On several test HBase clusters with serial replication enabled and where
> regionservers frequently crash / perform non-graceful shutdown, we found that
> WAL can contain entries with seqId equal to a barrier in the meta table, e.g.
> barriers for region X = [2, 5, 6], entry for region X seqId = 6 (equals to
> barrier with value 6), and pushedSeqId=4 (seqId-2).
>
> When checking if can push those entries in {_}SerialReplicationChecker{_},
> _canPush_ will return false, causing replication to block indefinitely.
>
> Example 1:
> {{2025-07-22T16:12:06,070 DEBUG
> [RS_CLAIM_REPLICATION_QUEUE-regionserver/home-host-1:16020-0.replicationSource,peer_1-home-host-1,16020,1753116284068.replicationSource.wal-reader.home-host-1%2C16020%2C1753116284068,peer_1-home-host-1,16020,1753116284068]
> regionserver.SerialReplicationChecker: Replication barrier for
> test_table/eb9d5e0c9147f04e0ef1296c959c2ae9/{*}39{*}=[#edits: 0 = <>]:
> ReplicationBarrierResult [{*}barriers=[9, 17, 25, 28, 31, 34, 38, 39{*}],
> state=OPEN, parentRegionNames=]}}
> {{2025-07-22T16:12:06,072 DEBUG
> [RS_CLAIM_REPLICATION_QUEUE-regionserver/home-host-1:16020-0.replicationSource,peer_1-home-host-1,16020,1753116284068.replicationSource.wal-reader.home-host-1%2C16020%2C1753116284068,peer_1-home-host-1,16020,1753116284068]
> regionserver.SerialReplicationChecker: *Previous range for
> test_table/eb9d5e0c9147f04e0ef1296c959c2ae9/39=[#edits: 0 = <>] has not been
> finished yet, give up*}}
> {{2025-07-22T16:12:06,072 DEBUG
> [RS_CLAIM_REPLICATION_QUEUE-regionserver/home-host-1:16020-0.replicationSource,peer_1-home-host-1,16020,1753116284068.replicationSource.wal-reader.home-host-1%2C16020%2C1753116284068,peer_1-home-host-1,16020,1753116284068]
> regionserver.SerialReplicationChecker: Can not push
> test_table/eb9d5e0c9147f04e0ef1296c959c2ae9/39=[#edits: 0 = <>], wait}}
>
> * barriers=[9, 17, 25, 28, 31, 34, 38, 39]
> * Entry is for HBASE::REGION_EVENT::REGION_OPEN with seqid=39 from *not the
> last* range (replication queue is claimed).
> * pushedSeqId=37
>
> The previous range is calculated as 39 instead of 38, and 37 >= 39-1 is false.
>
> See
> [https://docs.google.com/document/d/1iB2xopSoC2IRHR8wmbGX5cmaS0RKsdFJiKeJ7EyLzeg]
> for more supporting information (zookeeper state, WALs).
>
> Example 2:
>
> {{2025-08-05T07:43:53,198 DEBUG
> [RS_CLAIM_REPLICATION_QUEUE-regionserver/regionserver-0:16020-0.replicationSource,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843.replicationSource.wal-reader.regionserver-3.hbase.hbase.svc.cluster.local%2C16020%2C1754081460813,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843
> {}] regionserver.SerialReplicationChecker: Replication barrier for
> aeris_v2/cfc70a1a9c3a8c459dc4b79ece6d1ebd/{*}650974464{*}=[#edits: 0 = <>]:
> ReplicationBarrierResult [barriers=[649436971, {*}650974464{*}, 650990494,
> 651037843, 651092522, 651096754, 651118516, 651147941, 651173589],
> state=OPEN, parentRegionNames=]}}
> {{2025-08-05T07:43:53,199 DEBUG
> [RS_CLAIM_REPLICATION_QUEUE-regionserver/regionserver-0:16020-0.replicationSource,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843.replicationSource.wal-reader.regionserver-3.hbase.hbase.svc.cluster.local%2C16020%2C1754081460813,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843
> {}] regionserver.SerialReplicationChecker: *Previous range for
> aeris_v2/cfc70a1a9c3a8c459dc4b79ece6d1ebd/650974464=[#edits: 0 = <>] has not
> been finished yet, give up*}}
> {{2025-08-05T07:43:53,199 DEBUG
> [RS_CLAIM_REPLICATION_QUEUE-regionserver/regionserver-0:16020-0.replicationSource,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843.replicationSource.wal-reader.regionserver-3.hbase.hbase.svc.cluster.local%2C16020%2C1754081460813,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843
> {}] regionserver.SerialReplicationChecker: Can not push
> aeris_v2/cfc70a1a9c3a8c459dc4b79ece6d1ebd/650974464=[#edits: 0 = <>], wait}}
>
> * barriers=[649436971, 650974464, 650990494, …]
> * Entry is with seqid=650974464 from *not the last* range (replication queue
> is claimed).
> * pushedSeqId=650974462
>
> The previous range is calculated as 650974464 instead of 649436971, and
> 650974462 >= 650974464-1 is false.
> h1. Impact
> Replication is blocked indefinitely for regions that contain the problematic
> entry.
> Entries with higher seqId than the problematic entry cannot be replicated due
> to previous range(s) not being finished yet.
> Metric _sizeoflogqueue_ grows indefinitely as data gets written to the
> region(s) and WAL's are rolled.
> h1. Workarounds
> N/A.
> Turn off serial mode and replicate non-serially OR remove and re-add peer to
> restart replication (will have a gap in data replicated).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)