[GitHub] [iceberg] maxdebayser opened a new pull request, #7831: Compute parquet stats

via GitHub Tue, 13 Jun 2023 11:10:07 -0700


maxdebayser opened a new pull request, #7831:
URL: https://github.com/apache/iceberg/pull/7831


   @Fokko 
   
   This commit partly addresses issue 
https://github.com/apache/iceberg/issues/7256. Unfortunately the pyarrow 
library is not as flexible as we would like. When passing write_statistics=True 
to `pyarrow.parquet.write_table` the statistics are written out for each row 
group in the file, instead of computed globally.
   
   In the issue a "metadata_collector" was mentioned which I assume is the 
parameter of the `pyarrow.parquet.write_metadata` function. The 
`pyarrow.parquet.write_table` function has no such parameter.
   
   The function in this PR intentionally works at the level of individual 
parquet files instead of the dataset to support scenarios such as writing from 
Ray where each file of the dataset is written by a different task.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] maxdebayser opened a new pull request, #7831: Compute parquet stats

Reply via email to