avshenuk opened a new issue, #16359: URL: https://github.com/apache/pinot/issues/16359
### What needs to be done: Add upsert support to RealtimeToOfflineSegmentsTask to enable Pinot-managed offline flows for upsert-enabled realtime tables. ### Why the feature is needed: Currently, RealtimeToOfflineSegmentsTask doesn't support upsert tables, creating a gap in Pinot's managed offline flows for hybrid upsert architectures. This prevents users from leveraging the efficient realtime-to-offline conversion pipeline when using upsert tables. This feature is particularly valuable as it enables a clean separation between realtime tables with upsert support (handling mutable, evolving data) and offline tables containing the "final" immutable state of segments. This architecture allows users to benefit from upsert semantics during real-time ingestion while maintaining optimized, deduplicated offline segments for historical queries and analytics workloads. ### Proposed Implementation: 1. **Upsert Segment Validation**: Validate segments using validDocIds metadata from servers with CRC checks and server state validation. Fail task generation when metadata is missing since time windows cannot be reprocessed, preventing silent data loss. 2. **Invalid Record Filtering**: Integrate CompactedPinotSegmentRecordReader with the existing SegmentProcessorFramework to filter out invalid records during segment processing, naturally combining with time windowing, partitioning, merging, and sorting operations. ### Benefits: - Fills the gap for missing upsert support in Pinot's managed offline flow - Natural integration with existing segment processing pipeline (time filtering, sorting, etc.) - Maintains data consistency through proper validation and prevents silent data loss - Optimal timing for upsert compaction: Performs filtering during the realtime-to-offline conversion when data is already being processed, avoiding the need for separate compaction passes and maximizing efficiency - Operates independently of other upsert maintenance tasks with clear separation of concerns: - other upsert tasks (UpsertCompactionTask, UpsertCompactMergeTask) optimize realtime tables, while this task creates optimized segments in offline tables. This allows maintaining efficiency within realtime tables (if needed) while continuously updating offline tables with deduplicated historical data ### Backward Compatibility: This implementation maintains full backward compatibility for existing non-upsert tables. The upsert-specific logic (validDocIds validation and CompactedPinotSegmentRecordReader filtering) is only applied when the table has upsert configuration enabled. For non-upsert tables, the RealtimeToOfflineSegmentsTask continues to operate with the exact same behavior as before, ensuring no breaking changes to existing workflows. ### Related Issues: - Related to #12261 discussion on hybrid table upsert configurations -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org