sidkhillon opened a new pull request, #7909: URL: https://github.com/apache/hbase/pull/7909
When ReplicationSourceWALReader.run() detects a WAL file switch via the switched() check at line 160, it enqueues an EOF batch but does not update currentPosition. If the outer loop subsequently restarts (e.g., due to WALEntryFilterRetryableException), the new WALEntryStream is created with the stale position from the old WAL file, which gets applied to the new WAL file. This causes the reader to enter an infinite retry loop attempting to seek to an invalid position, permanently stalling replication. The switched() path at line 160 fires when readWALEntries() returns a batch without seeing EOF — either because batch capacity was reached, or because an error (e.g., NameNode timeout) caused hasNext() inside readWALEntries() to return RETRY, breaking the loop early. The next hasNext() at line 153 then detects EOF, dequeues the old file, and returns RETRY_IMMEDIATELY. The switched() check fires because currentPath(captured before hasNext()) was the old file, but the stream’s path is now null after the dequeue. currentPosition is not updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
