shangxinli commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3486667713
Thanks @pvary for the detailed feedback! I've addressed your points:
1. Decision at action level, not table properties:
Done - Removed the PARQUET_USE_FILE_MERGER from TableProperties entirely
and stripped out the fallback logic. Now it only checks the action options.
2. Early validation with proper fallback:
Done - Flipped the logic around to validate upfront with canUseMerger()
before attempting the merge. Also beefed up the validation to actually check
schema compatibility, not just file format. If anything fails, it logs and
falls back to the standard rewrite.
3. Planning creates expected sizes, runner doesn't split:
Done - Nuked the whole groupFilesBySize() method. The runner now just
merges whatever the planner gave it into a single file - no more re-grouping.
4. Use Catalog/Table FileIO instead of HadoopFileIO:
Done - Removed HadoopFileIO completely. Executors now just return path +
size, and the driver reads metrics using table().io() which respects whatever
FileIO the catalog configured.
5. Static methods instead of object creation:
Done - Converted ParquetFileMerger to a utility class with private
constructor and all static methods. No more new ParquetFileMerger() calls
anywhere.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]