kevinjqliu commented on PR #1743:
URL: https://github.com/apache/iceberg-python/pull/1743#issuecomment-2692351898

   Hi @afiodorov thanks for the PR! 
   
   For adding hive partitioned files to Iceberg, there's a specific way we can 
do so using column projections, 
https://iceberg.apache.org/spec/#column-projection. We've implemented the read 
side in #1443. We'd want to implement the write side as well by overridding the 
`partition` field in data_file object in the manifest.
   
   I think we need to define an API that does not involve regex. The example 
above is confusing as a user
   ```
   pattern = re.compile(r"([^/]+)=([^/]+)")
   
   def deduct_partition(path: str) -> Record:
       return Record(**dict(pattern.findall(path))
   
   table.add_files(['s3://bucket/table/year=2025/month=12/file.parquet'], 
check_schema=False, partition_deductor=deduct_partition)
   ```
   
   IMO the `add_files` API should not infer the hive-style partition scheme. 
Perhaps we can create a different API to "migrate" hive-style partition scheme 
to an Iceberg table. As long as we can create the proper `DataFile` object with 
the right `partition` field set, we can register these to the Iceberg Table. 
   
   pyarrow has a helper function 
[pyarrow.dataset.HivePartitioning](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.HivePartitioning.html#)
 which can parse the hive-style partition scheme without relying in regex
   
   Please let me know what you think! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to