syun64 commented on PR #506: URL: https://github.com/apache/iceberg-python/pull/506#issuecomment-1994561440
> We will replace file_path based partition inference with parquet metadata footer based partition inference. Currently we only support IdentityPartitions, and we can infer the partition values from the metadata footer's statistics. (upper and lower bounds should be equal). This will also allow us to create extend partition inference to numeric Transforms (YearTransform, etc) by applying the transforms on the lower and upper bounds. I just realized that this approach won't work if we want to add files from HIVE tables, because HIVE style partitioning results in parquet files that do not actually have the partition data in them. The partition columns are inferred from the directory structure. But I think the suggested approach should be favored over file path inference if it is possible. @Fokko , I'd love to get your opinion on the following: 1. We will introduce two modes of add_files: Hive path partition inference, versus parquet metadata min/max based partition inference. 2. To support Hive path partition inference mode, we will need to pass an exclusion list of partition columns that should be ignored in the DataFile stats in [compute_statistics_plan](https://github.com/syun64/iceberg-python/blob/70342ac83d2d1f121f3ab04c6d7317c8830fdad1/pyiceberg/io/pyarrow.py#L1500), so that the length stats_columns aligns with that of the parquet_metadata.num_columns These two modes cover some of the options that were discussed in the [initial discussion of the add_files migration procedure](https://github.com/apache/iceberg/issues/2068#issuecomment-773662070). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org