[
https://issues.apache.org/jira/browse/HBASE-29987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sid Khillon reassigned HBASE-29987:
-----------------------------------
Assignee: Sid Khillon
> Replication position corruption when WAL file switch detected in
> ReplicationSourceWALReader run loop
> ----------------------------------------------------------------------------------------------------
>
> Key: HBASE-29987
> URL: https://issues.apache.org/jira/browse/HBASE-29987
> Project: HBase
> Issue Type: Bug
> Components: Replication, wal, Zookeeper
> Reporter: Sid Khillon
> Assignee: Sid Khillon
> Priority: Minor
>
> When {{ReplicationSourceWALReader.run()}} detects a WAL file switch via the
> {{switched()}} check at line 160, it enqueues an EOF batch but does not
> update {{{}currentPosition{}}}. If the outer loop subsequently restarts
> (e.g., due to {{{}WALEntryFilterRetryableException{}}}), the new
> {{WALEntryStream}} is created with the stale position from the old WAL file,
> which gets applied to the new WAL file. This causes the reader to enter an
> infinite retry loop attempting to seek to an invalid position, permanently
> stalling replication.
>
> The {{switched()}} path at line 160 fires when {{readWALEntries()}} returns a
> batch without seeing EOF — either because batch capacity was reached, or
> because an error (e.g., NameNode timeout) caused {{hasNext()}} inside
> {{readWALEntries()}} to return RETRY, breaking the loop early. The next
> {{hasNext()}} at line 153 then detects EOF, dequeues the old file, and
> returns {{{}RETRY_IMMEDIATELY{}}}. The {{switched()}} check fires because
> {{{}currentPath{}}}(captured before {{{}hasNext(){}}}) was the old file, but
> the stream’s path is now null after the dequeue. {{currentPosition}} is not
> updated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)