tarun11Mavani commented on issue #14588:
URL: https://github.com/apache/pinot/issues/14588#issuecomment-3145237878

   ### RFC: Commit-Time Compaction for Upsert Tables
   
   In its current implementation, users rely on Compaction Task executed on 
minions to reduce the disk footprint by compacting older segments. This 
proposal focuses on a different aspect of optimization for upsert tables: 
addressing the significant storage and processing overhead caused by invalid 
and soft-deleted records within immutable segments before they are committed.
   
   Currently, when a consuming segment commits, it retains all records, 
including obsolete and soft-deleted ones. This leads to:
   
   - Storage Inefficiency: Immutable segments hold substantially more physical 
records than logically valid ones.
   
   - Query Overhead: Query engines process and filter unnecessary data, 
impacting performance.
   
   - Compaction Dependency: Users rely on post-commit minion tasks for storage 
reclamation, adding operational complexity and delaying optimization.
   
   In high-velocity upsert scenarios, this can result in a physical-to-logical 
record ratio as high as  10:1, severely impacting storage costs and query 
performance.
   
   To mitigate these challenges, we propose Commit-Time Compaction for Upsert 
Tables. In this model, invalid and obsolete records will be removed during the 
segment commit process, before the segment becomes immutable.
   
   ## Key Benefits
   This approach aims to reduce the overall segment size and improve query 
efficiency by:
   
   - Immediate Storage Optimization: Up to 60-80% reduction in physical record 
count for high-update workloads.
   
   - Improved Query Performance
   
   - Reduced Memory and I/O Footprint 
   
   - Simplified Operations: Less reliance on post-commit compaction tasks, 
reducing operational complexity.
   
   
   Here is the RFC for the same: 
https://docs.google.com/document/d/1kylGQScvvP7t2Yl2Tc-2iS8XGEKuI8J5D2lGcEaZhOM/edit?usp=sharing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to