[PR] KAFKA-8202 Fix stack overflow when batch size is larger than cluster max.message.byte [kafka]

via GitHub Fri, 15 Aug 2025 15:31:40 -0700


knoxy5467 opened a new pull request, #20358:
URL: https://github.com/apache/kafka/pull/20358


   ### Summary
   This PR fixes two critical issues related to producer batch splitting that 
can cause infinite retry loops and stack overflow errors when batch sizes are 
significantly larger than broker-configured message size limits.
   
   ### Issues Addressed
   - **KAFKA-8350**: Producers endlessly retry batch splitting when 
`batch.size` is much larger than topic-level `message.max.bytes`, leading to 
infinite retry loops with "MESSAGE_TOO_LARGE" errors
   - **KAFKA-8202**: Stack overflow errors in `FutureRecordMetadata.chain()` 
due to excessive recursive splitting attempts
   
   ### Root Cause
   The existing batch splitting logic in 
`RecordAccumulator.splitAndReenqueue()` always used the configured `batchSize` 
parameter for splitting, regardless of whether the batch had already been split 
before. This caused:
   
   1. **Infinite loops**: When `batch.size` (e.g., 8MB) >> `message.max.bytes` 
(e.g., 1MB), splits would never succeed since the split size was still too large
   2. **Stack overflow**: Repeated splitting attempts created deep call chains 
in the metadata chaining logic
   
   ### Solution
   Implemented progressive batch splitting logic:
   
   ```java
   int maxBatchSize = this.batchSize;
   if (bigBatch.isSplitBatch()) {
       maxBatchSize = Math.max(bigBatch.maxRecordSize, 
bigBatch.estimatedSizeInBytes() / 2);
   }
   ```
   
   __Key improvements:__
   
   - __First split__: Uses original `batchSize` (maintains backward 
compatibility)
   
   - __Subsequent splits__: Uses the larger of:
   
     - `maxRecordSize`: Ensures we can always split down to individual records
     - `estimatedSizeInBytes() / 2`: Provides geometric reduction for faster 
convergence
   
   ### Testing
   
   Added comprehensive test `testSplitAndReenqueuePreventInfiniteRecursion()` 
that:
   
   - Creates oversized batches with 100 records of 1KB each
   - Verifies splitting can reduce batches to single-record size
   - Ensures no infinite recursion (safety limit of 100 operations)
   - Validates no data loss or duplication during splitting
   - Confirms all original records are preserved with correct keys
   
   ### Backward Compatibility
   
   - No breaking changes to public APIs
   - First split attempt still uses original `batchSize` configuration
   - Progressive splitting only engages for retry scenarios
   
   ###
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] KAFKA-8202 Fix stack overflow when batch size is larger than cluster max.message.byte [kafka]

Reply via email to