syun64 commented on issue #208: URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1912495464
Right, as @jqin61 mentioned, if we only had to support **Transformed Partitions**, we could have employed some hack to add partition column to the dataset, which gets consumed by write_dataset API when we pass the column in pyarrow.dataset.partitioning. But we can't apply the same hack with **Identity Partitions**, where the HIVE partition scheme on the file path shares the same name as the partition column that needs to be persisted into the data file. Arrow does not allow two columns to share the same name, and this hack will lead to an exception on **write_dataset**. So it sounds like we might be running out of options in using the existing APIs... If we are in agreement that we need a new PyArrow API to optimally bucket sort the partitions and produce partitioned pyarrow tables or record batches to pass into [WriteTask](https://github.com/apache/iceberg-python/blob/cd7fb502900a717d6b902a398b267eb10e4faa9b/pyiceberg/table/__init__.py#L2234), do we see any value in introducing a simpler PyIceberg feature in the interim, where [write_file](https://github.com/apache/iceberg-python/blob/cd7fb502900a717d6b902a398b267eb10e4faa9b/pyiceberg/io/pyarrow.py#L1693) can support partitioned tables as long as the provided arrow_table only has a single partition of data? I think introducing this first would have two upsides: 1. We decouple the work of supporting writes to partitioned table (like handling partitions in file paths on write, adding partition metadata to manifests) with the work of optimally sorting and bucketing an arrow table into target partitions 2. If a user **really** needs to break down their in memory pyarrow table into partitions, they can do so, using existing methods to filter on the partition column and producing a new pyarrow.Table. **This isn't optimal**, especially if they have many partitions within the in-memory table, and is precisely the reason why @jqin61 is investigating the different options in bucket sorting by partition within Arrow/Arrow Datasets. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org