Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

via GitHub Tue, 04 Nov 2025 07:45:42 -0800


shangxinli commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3486667713


   Thanks @pvary for the detailed feedback! I've addressed your points:
   
     1. Decision at action level, not table properties:
     Done - Removed the PARQUET_USE_FILE_MERGER from TableProperties entirely 
and stripped out the fallback logic. Now it only checks the action options.
   
     2. Early validation with proper fallback:
      Done - Flipped the logic around to validate upfront with canUseMerger() 
before attempting the merge. Also beefed up the validation to actually check 
schema compatibility, not just file format. If anything fails, it logs and 
falls back to the standard rewrite.
   
     3. Planning creates expected sizes, runner doesn't split:
      Done - Nuked the whole groupFilesBySize() method. The runner now just 
merges whatever the planner gave it into a single file - no more re-grouping.
   
     4. Use Catalog/Table FileIO instead of HadoopFileIO:
     Done - Removed HadoopFileIO completely. Executors now just return path + 
size, and the driver reads metrics using table().io() which respects whatever 
FileIO the catalog configured.
   
     5. Static methods instead of object creation:
      Done - Converted ParquetFileMerger to a utility class with private 
constructor and all static methods. No more new ParquetFileMerger() calls 
anywhere.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

Reply via email to