LiebingYu opened a new issue, #2941:
URL: https://github.com/apache/fluss/issues/2941

   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/fluss/issues) and 
found nothing similar.
   
   
   ### Fluss version
   
   main (development)
   
   ### Please describe the bug 🐞
   
   After an unclean shutdown (e.g., `SIGKILL`, OOM, power failure), 
`TabletServer` fails to restart due to a `FlussRuntimeException` wrapping an 
`EOFException` during log recovery. The server becomes completely unrecoverable 
without manual intervention.
   
   On unclean shutdown, `recoverSegment()` is only called when `sanityCheck()` 
throws (i.e., index file is corrupt). When the index file is intact but the 
`.log` file has a truncated tail — the most common unclean shutdown scenario — 
`sanityCheck()` passes, `recoverSegment()` is skipped, and a subsequent call to 
`readNextOffset()` → `maxTimestampSoFar()` → `readLargestTimestamp()` hits the 
partial batch and crashes.
   
   **Steps to Reproduce:**
   
   1. Start a `TabletServer` and produce data to a table (e.g., 
`t_geely_2x_freeze_frame_info`, bucket 33).
   2. Force-kill the `TabletServer` process (`kill -9`) while writes are in 
progress, ensuring the active log segment has unflushed data.
   3. Restart the `TabletServer`.
   4. Observe that the server fails to start with the stack trace below.
   
   **Stack Trace:**
   
   ```
   org.apache.fluss.exception.FlussRuntimeException: Failed to recovery log
       at org.apache.fluss.server.log.LogManager.loadLogs(LogManager.java:207)
       at org.apache.fluss.server.log.LogManager.startup(LogManager.java:139)
       at 
org.apache.fluss.server.tablet.TabletServer.startServices(TabletServer.java:228)
       ...
   Caused by: org.apache.fluss.exception.FlussRuntimeException: Failed to load 
record batch at position 495460312
       from FileRecords(size=495460352, file=.../00000000000000000000.log, 
start=0, end=2147483647)
       at 
org.apache.fluss.record.FileLogInputStream$FileChannelLogRecordBatch.loadByteBufferWithSize(FileLogInputStream.java:222)
       at 
org.apache.fluss.record.FileLogInputStream$FileChannelLogRecordBatch.loadBatchHeader(FileLogInputStream.java:211)
       at 
org.apache.fluss.record.FileLogInputStream$FileChannelLogRecordBatch.commitTimestamp(FileLogInputStream.java:134)
       at 
org.apache.fluss.record.FileLogRecords.largestTimestampAfter(FileLogRecords.java:386)
       at 
org.apache.fluss.server.log.LogSegment.readLargestTimestamp(LogSegment.java:644)
       at 
org.apache.fluss.server.log.LogSegment.readMaxTimestampAndStartOffsetSoFar(LogSegment.java:214)
       at 
org.apache.fluss.server.log.LogSegment.maxTimestampSoFar(LogSegment.java:200)
       at org.apache.fluss.server.log.LogSegment.recover(LogSegment.java:319)
       at 
org.apache.fluss.server.log.LogLoader.recoverSegment(LogLoader.java:269)
       at org.apache.fluss.server.log.LogLoader.recoverLog(LogLoader.java:168)
       ...
   Caused by: java.io.EOFException: Failed to read `record batch header` from 
file channel.
       Expected to read 48 bytes, but reached end of file after reading 40 
bytes.
       Started read from position 495460312.
       at org.apache.fluss.utils.FileUtils.readFullyOrFail(FileUtils.java:110)
       at 
org.apache.fluss.utils.FileUtils.loadByteBufferFromFile(FileUtils.java:138)
       ...
   ```
   
   ### Solution
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to