avshenuk opened a new issue, #16359:
URL: https://github.com/apache/pinot/issues/16359

   ### What needs to be done:
   Add upsert support to RealtimeToOfflineSegmentsTask to enable Pinot-managed 
offline flows for upsert-enabled realtime tables.
   
   ### Why the feature is needed:
   Currently, RealtimeToOfflineSegmentsTask doesn't support upsert tables, 
creating a gap in Pinot's managed offline flows for hybrid upsert 
architectures. 
   This prevents users from leveraging the efficient realtime-to-offline 
conversion pipeline when using upsert tables.
   
   This feature is particularly valuable as it enables a clean separation 
between realtime tables with upsert support (handling mutable, evolving data) 
and offline tables containing the "final" immutable state of segments. 
   This architecture allows users to benefit from upsert semantics during 
real-time ingestion while maintaining optimized, deduplicated offline segments 
for historical queries and analytics workloads.
   
   ### Proposed Implementation:
   1. **Upsert Segment Validation**: Validate segments using validDocIds 
metadata from servers with CRC checks and server state validation. Fail task 
generation when metadata is missing since time windows cannot be reprocessed, 
preventing silent data loss.
   2. **Invalid Record Filtering**: Integrate CompactedPinotSegmentRecordReader 
with the existing SegmentProcessorFramework to filter out invalid records 
during segment processing, naturally combining with time windowing, 
partitioning, merging, and sorting operations.
   
   ### Benefits:
   - Fills the gap for missing upsert support in Pinot's managed offline flow
   - Natural integration with existing segment processing pipeline (time 
filtering, sorting, etc.)
   - Maintains data consistency through proper validation and prevents silent 
data loss
   - Optimal timing for upsert compaction:
     Performs filtering during the realtime-to-offline conversion when data is 
already being processed, avoiding the need for separate compaction passes and 
maximizing efficiency
   - Operates independently of other upsert maintenance tasks with clear 
separation of concerns:
      - other upsert tasks (UpsertCompactionTask, UpsertCompactMergeTask) 
optimize realtime tables, while this task creates optimized segments in offline 
tables. This allows maintaining efficiency within realtime tables (if needed) 
while continuously updating offline tables with deduplicated historical data
   
   ### Backward Compatibility:
   This implementation maintains full backward compatibility for existing 
non-upsert tables. 
   The upsert-specific logic (validDocIds validation and 
CompactedPinotSegmentRecordReader filtering) is only applied when the table has 
upsert configuration enabled.
   For non-upsert tables, the RealtimeToOfflineSegmentsTask continues to 
operate with the exact same behavior as before, ensuring no breaking changes to 
existing workflows.
   
   ### Related Issues:
   - Related to #12261 discussion on hybrid table upsert configurations


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to