kevinjqliu commented on PR #1743: URL: https://github.com/apache/iceberg-python/pull/1743#issuecomment-2692351898
Hi @afiodorov thanks for the PR! For adding hive partitioned files to Iceberg, there's a specific way we can do so using column projections, https://iceberg.apache.org/spec/#column-projection. We've implemented the read side in #1443. We'd want to implement the write side as well by overridding the `partition` field in data_file object in the manifest. I think we need to define an API that does not involve regex. The example above is confusing as a user ``` pattern = re.compile(r"([^/]+)=([^/]+)") def deduct_partition(path: str) -> Record: return Record(**dict(pattern.findall(path)) table.add_files(['s3://bucket/table/year=2025/month=12/file.parquet'], check_schema=False, partition_deductor=deduct_partition) ``` IMO the `add_files` API should not infer the hive-style partition scheme. Perhaps we can create a different API to "migrate" hive-style partition scheme to an Iceberg table. As long as we can create the proper `DataFile` object with the right `partition` field set, we can register these to the Iceberg Table. pyarrow has a helper function [pyarrow.dataset.HivePartitioning](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.HivePartitioning.html#) which can parse the hive-style partition scheme without relying in regex Please let me know what you think! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org