Hi ZooKeeper community, I'd like to raise a discussion about a potential data integrity gap in ZooKeeper's startup recovery process — specifically, the lack of zxid continuity validation when replaying transaction logs. *Problem Description* We recently encountered a production incident where an intermediate transaction log file was accidentally deleted (e.g., by an overly aggressive cleanup script or manual error). When the ZooKeeper server restarted, it loaded the latest valid snapshot and replayed the remaining transaction logs without detecting the missing file. This resulted in a silent data hole — thousands of transactions were skipped, and the server came up with an inconsistent state that was not immediately apparent.
*Root Cause Analysis* After examining the source code (branch release-3.6.3), we identified that no zxid continuity check exists in the startup recovery path: FileTxnSnapLog.fastForwardFromEdits() — The replay loop only checks for zxid regression (hdr.getZxid() < highestZxid), but does not verify that each zxid is exactly previousZxid + 1 (within the same epoch). A gap from, say, 0x9000379fe to 0x900046011 (58,898 missing transactions) would go completely unnoticed. FileTxnIterator.next() — When reaching EOF of one log file, it simply calls goToNextLog() to switch to the next file without any validation that the last zxid of the previous file and the first zxid of the next file are consecutive. FileTxnIterator.init() — Files are selected and sorted purely by their filename zxid; no check is performed to ensure the file set forms a complete, gapless sequence. Interestingly, LogChopper (an offline tool) already contains gap detection logic between transactions, which suggests the community is aware of this concern — but this check is not present in the critical startup path. *Impact* When a transaction log file is missing: The server starts successfully with no error or warning The in-memory data tree silently skips all transactions in the missing range Clients observe missing znodes, stale data, or inconsistent state In a cluster, this can lead to divergent state between replicas that is very difficult to diagnose *Proposal* We would like to propose adding an optional zxid continuity check during transaction log replay at startup. Specifically: In fastForwardFromEdits(), track the previous zxid and validate that each new zxid equals previousZxid + 1 (with an exception for epoch transitions where the counter resets to 1). If a gap is detected, log a clear ERROR message indicating the gap range and the number of missing transactions. Introduce a configuration option (e.g., zookeeper.txnlog.integrityCheck.enabled, default false for backward compatibility) that, when set to true, causes the server to refuse to start if a gap is detected, rather than silently proceeding with incomplete data. This would be a relatively small, low-risk change that could prevent a class of silent data corruption issues. *Questions for the Community* 1. Has this scenario been discussed before? I couldn't find an existing JIRA issue covering this specific case. 2. Would the community be open to a contribution that adds this validation? If so, should it be a hard failure or a warning by default? Thank you for your time and feedback. Happy to submit a JIRA issue and a PR if there is interest. Best regards
