tarun11Mavani commented on issue #14588: URL: https://github.com/apache/pinot/issues/14588#issuecomment-3145237878
### RFC: Commit-Time Compaction for Upsert Tables In its current implementation, users rely on Compaction Task executed on minions to reduce the disk footprint by compacting older segments. This proposal focuses on a different aspect of optimization for upsert tables: addressing the significant storage and processing overhead caused by invalid and soft-deleted records within immutable segments before they are committed. Currently, when a consuming segment commits, it retains all records, including obsolete and soft-deleted ones. This leads to: - Storage Inefficiency: Immutable segments hold substantially more physical records than logically valid ones. - Query Overhead: Query engines process and filter unnecessary data, impacting performance. - Compaction Dependency: Users rely on post-commit minion tasks for storage reclamation, adding operational complexity and delaying optimization. In high-velocity upsert scenarios, this can result in a physical-to-logical record ratio as high as 10:1, severely impacting storage costs and query performance. To mitigate these challenges, we propose Commit-Time Compaction for Upsert Tables. In this model, invalid and obsolete records will be removed during the segment commit process, before the segment becomes immutable. ## Key Benefits This approach aims to reduce the overall segment size and improve query efficiency by: - Immediate Storage Optimization: Up to 60-80% reduction in physical record count for high-update workloads. - Improved Query Performance - Reduced Memory and I/O Footprint - Simplified Operations: Less reliance on post-commit compaction tasks, reducing operational complexity. Here is the RFC for the same: https://docs.google.com/document/d/1kylGQScvvP7t2Yl2Tc-2iS8XGEKuI8J5D2lGcEaZhOM/edit?usp=sharing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
