Hi ZooKeeper community,

I'd like to raise a discussion about a potential data integrity gap in
ZooKeeper's startup recovery process — specifically, the lack of zxid
continuity validation when replaying transaction logs.
*Problem Description*
We recently encountered a production incident where an intermediate
transaction log file was accidentally deleted (e.g., by an overly
aggressive cleanup script or manual error). When the ZooKeeper server
restarted, it loaded the latest valid snapshot and replayed the remaining
transaction logs without detecting the missing file. This resulted in a
silent data hole — thousands of transactions were skipped, and the server
came up with an inconsistent state that was not immediately apparent.

*Root Cause Analysis*
After examining the source code (branch release-3.6.3), we identified that
no zxid continuity check exists in the startup recovery path:
FileTxnSnapLog.fastForwardFromEdits() — The replay loop only checks for
zxid regression (hdr.getZxid() < highestZxid), but does not verify that
each zxid is exactly previousZxid + 1 (within the same epoch). A gap from,
say, 0x9000379fe to 0x900046011 (58,898 missing transactions) would go
completely unnoticed.
FileTxnIterator.next() — When reaching EOF of one log file, it simply calls
goToNextLog() to switch to the next file without any validation that the
last zxid of the previous file and the first zxid of the next file are
consecutive.
FileTxnIterator.init() — Files are selected and sorted purely by their
filename zxid; no check is performed to ensure the file set forms a
complete, gapless sequence.
Interestingly, LogChopper (an offline tool) already contains gap detection
logic between transactions, which suggests the community is aware of this
concern — but this check is not present in the critical startup path.

*Impact*
When a transaction log file is missing:
The server starts successfully with no error or warning
The in-memory data tree silently skips all transactions in the missing range
Clients observe missing znodes, stale data, or inconsistent state
In a cluster, this can lead to divergent state between replicas that is
very difficult to diagnose

*Proposal*
We would like to propose adding an optional zxid continuity check during
transaction log replay at startup. Specifically:
In fastForwardFromEdits(), track the previous zxid and validate that each
new zxid equals previousZxid + 1 (with an exception for epoch transitions
where the counter resets to 1).
If a gap is detected, log a clear ERROR message indicating the gap range
and the number of missing transactions.
Introduce a configuration option (e.g.,
zookeeper.txnlog.integrityCheck.enabled, default false for backward
compatibility) that, when set to true, causes the server to refuse to start
if a gap is detected, rather than silently proceeding with incomplete data.
This would be a relatively small, low-risk change that could prevent a
class of silent data corruption issues.

*Questions for the Community*
1. Has this scenario been discussed before? I couldn't find an existing
JIRA issue covering this specific case.
2. Would the community be open to a contribution that adds this validation?
If so, should it be a hard failure or a warning by default?



Thank you for your time and feedback. Happy to submit a JIRA issue and a PR
if there is interest.
Best regards

Reply via email to