LiebingYu opened a new issue, #2941: URL: https://github.com/apache/fluss/issues/2941
### Search before asking - [x] I searched in the [issues](https://github.com/apache/fluss/issues) and found nothing similar. ### Fluss version main (development) ### Please describe the bug 🐞 After an unclean shutdown (e.g., `SIGKILL`, OOM, power failure), `TabletServer` fails to restart due to a `FlussRuntimeException` wrapping an `EOFException` during log recovery. The server becomes completely unrecoverable without manual intervention. On unclean shutdown, `recoverSegment()` is only called when `sanityCheck()` throws (i.e., index file is corrupt). When the index file is intact but the `.log` file has a truncated tail — the most common unclean shutdown scenario — `sanityCheck()` passes, `recoverSegment()` is skipped, and a subsequent call to `readNextOffset()` → `maxTimestampSoFar()` → `readLargestTimestamp()` hits the partial batch and crashes. **Steps to Reproduce:** 1. Start a `TabletServer` and produce data to a table (e.g., `t_geely_2x_freeze_frame_info`, bucket 33). 2. Force-kill the `TabletServer` process (`kill -9`) while writes are in progress, ensuring the active log segment has unflushed data. 3. Restart the `TabletServer`. 4. Observe that the server fails to start with the stack trace below. **Stack Trace:** ``` org.apache.fluss.exception.FlussRuntimeException: Failed to recovery log at org.apache.fluss.server.log.LogManager.loadLogs(LogManager.java:207) at org.apache.fluss.server.log.LogManager.startup(LogManager.java:139) at org.apache.fluss.server.tablet.TabletServer.startServices(TabletServer.java:228) ... Caused by: org.apache.fluss.exception.FlussRuntimeException: Failed to load record batch at position 495460312 from FileRecords(size=495460352, file=.../00000000000000000000.log, start=0, end=2147483647) at org.apache.fluss.record.FileLogInputStream$FileChannelLogRecordBatch.loadByteBufferWithSize(FileLogInputStream.java:222) at org.apache.fluss.record.FileLogInputStream$FileChannelLogRecordBatch.loadBatchHeader(FileLogInputStream.java:211) at org.apache.fluss.record.FileLogInputStream$FileChannelLogRecordBatch.commitTimestamp(FileLogInputStream.java:134) at org.apache.fluss.record.FileLogRecords.largestTimestampAfter(FileLogRecords.java:386) at org.apache.fluss.server.log.LogSegment.readLargestTimestamp(LogSegment.java:644) at org.apache.fluss.server.log.LogSegment.readMaxTimestampAndStartOffsetSoFar(LogSegment.java:214) at org.apache.fluss.server.log.LogSegment.maxTimestampSoFar(LogSegment.java:200) at org.apache.fluss.server.log.LogSegment.recover(LogSegment.java:319) at org.apache.fluss.server.log.LogLoader.recoverSegment(LogLoader.java:269) at org.apache.fluss.server.log.LogLoader.recoverLog(LogLoader.java:168) ... Caused by: java.io.EOFException: Failed to read `record batch header` from file channel. Expected to read 48 bytes, but reached end of file after reading 40 bytes. Started read from position 495460312. at org.apache.fluss.utils.FileUtils.readFullyOrFail(FileUtils.java:110) at org.apache.fluss.utils.FileUtils.loadByteBufferFromFile(FileUtils.java:138) ... ``` ### Solution _No response_ ### Are you willing to submit a PR? - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
