syun64 commented on PR #506:
URL: https://github.com/apache/iceberg-python/pull/506#issuecomment-1994561440

   > We will replace file_path based partition inference with parquet metadata 
footer based partition inference. Currently we only support IdentityPartitions, 
and we can infer the partition values from the metadata footer's statistics. 
(upper and lower bounds should be equal). This will also allow us to create 
extend partition inference to numeric Transforms (YearTransform, etc) by 
applying the transforms on the lower and upper bounds.
   
   I just realized that this approach won't work if we want to add files from 
HIVE tables, because HIVE style partitioning results in parquet files that do 
not actually have the partition data in them. The partition columns are 
inferred from the directory structure. But I think the suggested approach 
should be favored over file path inference if it is possible.
   
   @Fokko , I'd love to get your opinion on the following:
   
   1. We will introduce two modes of add_files: Hive path partition inference, 
versus parquet metadata min/max based partition inference.
   2. To support Hive path partition inference mode, we will need to pass an 
exclusion list of partition columns that should be ignored in the DataFile 
stats in 
[compute_statistics_plan](https://github.com/syun64/iceberg-python/blob/70342ac83d2d1f121f3ab04c6d7317c8830fdad1/pyiceberg/io/pyarrow.py#L1500),
 so that the length stats_columns aligns with that of the 
parquet_metadata.num_columns
   
   These two modes cover some of the options that were discussed in the 
[initial discussion of the add_files migration 
procedure](https://github.com/apache/iceberg/issues/2068#issuecomment-773662070).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to