syun64 opened a new pull request, #506: URL: https://github.com/apache/iceberg-python/pull/506
PyIceberg's version of add_files Spark migration procedure. Some early ideas on its implementation: - instead of staying with the input interface for Spark's Procedure, we could just allow the users to pass a list of full file_paths instead - current implementation infers the partition values from the path and doesn't validate if the files themselves have the partition values. We could instead of statistics from the parquet metadata to check the min and max values of the partition columns (min and max should be the same for a partition column) and use that value to derive the partition record value instead. If the statistic is present, this would be more accurate than inferring the value through string match on the partition path - only Identity Transforms are currently supported. This is because in order to construct the manifest entries for the data files from the partition path, we need to convert human string values to their respective internal partition representations that get encoded as the partition values in the avro files. This is challenging to do for the Transform partitions, since the we will need to create a reverse transformation of the human string to partition representation for every supported type of IcebergType+Transform pairs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org