ShivramSriramulu opened a new pull request, #20513:
URL: https://github.com/apache/kafka/pull/20513

   ## Summary
   
   This PR enhances MirrorMaker 2 (MM2) with fault-tolerance capabilities to 
address critical data loss scenarios in cross-cluster replication setups.
   
   ## Problem Statement
   
   Vanilla MM2 has two critical gaps:
   1. **Silent Data Loss**: Retention policies may purge messages before 
replication completes, creating undetectable gaps
   2. **Service Disruption**: Topic delete/recreate operations can cause 
replication failures or stalls
   
   ## Solution
   
   Added fault-tolerance enhancements to `MirrorSourceTask`:
   
   ### Fail-Fast Truncation Detection
   - Catches `OffsetOutOfRangeException` during consumer polling
   - Logs detailed diagnostics with partition assignments and earliest offsets
   - Throws `ConnectException` to fail-fast and alert operators immediately
   - Configurable via `mirrorsource.fail.on.truncation=true` (default)
   
   ### Graceful Topic Reset Handling  
   - Uses `AdminClient` to track topic IDs and detect delete/recreate events
   - Automatically seeks to beginning offset for reset topics
   - Handles `UnknownTopicOrPartitionException` with retry logic
   - Configurable via `mirrorsource.auto.recover.on.reset=true` (default)
   
   ## Technical Details
   
   - **File Modified**: 
`connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorSourceTask.java`
   - **Lines Added**: ~75 LOC (well under 500 LOC requirement)
   - **Backward Compatibility**: Maintained - all changes are additive
   - **Configuration**: New properties with sensible defaults
   - **Logging**: Uses dedicated logger `mm2.fault.tolerance` for easy filtering
   
   ## Configuration Properties
   
   ```properties
   # Fail-fast on truncation (default: true)
   mirrorsource.fail.on.truncation=true
   
   # Auto-recover on topic reset (default: true)  
   mirrorsource.auto.recover.on.reset=true
   
   # Retry delay for topic reset (default: 5000ms)
   mirrorsource.topic.reset.retry.ms=5000
   ```
   
   ## Testing
   
   - Comprehensive test scenarios in companion repository
   - Docker-based demo with Primary/DR clusters
   - Validates both fail-fast and auto-recovery behaviors
   - Test repository: https://github.com/ShivramSriramulu/Tiger_MM2
   
   ## Impact
   
   - **RPO Improvement**: Makes data loss immediately visible instead of silent
   - **RTO Improvement**: Reduces manual intervention during maintenance
   - **Operational**: Clear error messages for troubleshooting
   - **Production Ready**: Minimal performance impact, configurable behavior
   
   ## Related
   
   - Companion demo repository: https://github.com/ShivramSriramulu/Tiger_MM2
   - Docker images: shivramsriramulu/enhanced-mm2:latest


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to