shangxinli opened a new pull request, #14435:
URL: https://github.com/apache/iceberg/pull/14435

     ## Why this change?
   
     This implementation provides significant performance improvements for 
Parquet
     file merging operations by eliminating serialization/deserialization 
overhead.
     Benchmark results show **13x faster** file merging compared to traditional
     read-rewrite approaches.
   
     The change leverages existing Parquet library capabilities 
(ParquetFileWriter
     appendFile API) to perform zero-copy row-group merging, making it ideal for
     compaction and maintenance operations on large Iceberg tables.
   
     TODO: 1) Encrypted tables are not supported yet. 2) Schema evolution is 
not handled yet
   
     ## What changed?
   
     - Added ParquetFileMerger class for row-group level file merging
       - Performs zero-copy merging using ParquetFileWriter.appendFile()
       - Validates schema compatibility across all input files
       - Supports merging multiple Parquet files into a single output file
     - Reuses existing Apache Parquet library functionality instead of custom 
implementation
     - Strict schema validation ensures data integrity during merge operations
     - Added comprehensive error handling for schema mismatches
   
     ## Testing
   
     - Validated in staging test environment
     - Verified schema compatibility checks work correctly
     - Confirmed 13x performance improvement over traditional approach
     - Tested with various file sizes and row group configurations


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to