Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

via GitHub Thu, 27 Nov 2025 03:32:47 -0800


Guosmilesmile commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3585370871


   
   | Data Files | Records per Data File | Normal Merge (s) | Parquet Merge (s) |
   
|------------:|:---------------------:|-----------------:|------------------:|
   | 5          | 100*10000                  | 4.7              | 0.9           
    |
   | 5          | 10*10000                   | 1.5              | 1.0           
    |
   | 5          | 1*10000                   | 0.8              | 0.9            
   |
   | 40         | 10*10000                   | 6.7              | 4.4           
    |
   | 40         | 1*10000                    | 4.1              | 3.6           
    |
   | 40         | 100                   | 3.4              | 3.4               |
   | 100        | 1000                    | 12.0             | 8.4              
 |
   
   I ran some tests comparing Parquet merge and normal merge; the Parquet merge 
version I used is the original one without any lineage-related changes. From 
the current results, when there are many files with small contents, the 
performance advantage of Parquet merge is not particularly large; when files 
contain many rows, the advantage is significant. I suspect that validating the 
schema and reading the footer for every file introduces additional overhead.
   
    I suggest that once the lineage part is ready, we add corresponding tests, 
because adding lineage will introduce more complexity.
   
   These were manual tests; results may vary and are for reference only.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

Reply via email to