Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

via GitHub Wed, 19 Nov 2025 01:29:00 -0800


pvary commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3551701479


   > Great point and example! You're right! I think there could be two 
solutions:
   > 
   > Solution 1: Group files by row ID continuity
   > 
   > Group files with continuous _row_id ranges, preserve IDs by setting merged 
file's firstRowId to match.
   
   This is typically not an option, and very hard to find continuous stretches 
of row_ids.
   
   > Solution 2: Write _row_id as physical column
   > 
   > Read virtual _row_id from each row and write as physical column.
   
   Can we do this without rewriting the rowgroup? Do we still have gains 
compared to the "normal" read/write compaction?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

Reply via email to