syun64 commented on issue #208:
URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1912495464

   Right, as @jqin61 mentioned, if we only had to support **Transformed 
Partitions**, we could have employed some hack to add partition column to the 
dataset, which gets consumed by write_dataset API when we pass the column in 
pyarrow.dataset.partitioning.
   
   But we can't apply the same hack with **Identity Partitions**, where the 
HIVE partition scheme on the file path shares the same name as the partition 
column that needs to be persisted into the data file. Arrow does not allow two 
columns to share the same name, and this hack will lead to an exception on 
**write_dataset**.
   
   So it sounds like we might be running out of options in using the existing 
APIs...
   
   If we are in agreement that we need a new PyArrow API to optimally bucket 
sort the partitions and produce partitioned pyarrow tables or record batches to 
pass into 
[WriteTask](https://github.com/apache/iceberg-python/blob/cd7fb502900a717d6b902a398b267eb10e4faa9b/pyiceberg/table/__init__.py#L2234),
 do we see any value in introducing a simpler PyIceberg feature in the interim, 
where 
[write_file](https://github.com/apache/iceberg-python/blob/cd7fb502900a717d6b902a398b267eb10e4faa9b/pyiceberg/io/pyarrow.py#L1693)
 can support partitioned tables as long as the provided arrow_table only has a 
single partition of data?
   
   I think introducing this first would have two upsides:
   1. We decouple the work of supporting writes to partitioned table (like 
handling partitions in file paths on write, adding partition metadata to 
manifests) with the work of optimally sorting and bucketing an arrow table into 
target partitions
   2. If a user **really** needs to break down their in memory pyarrow table 
into partitions, they can do so, using existing methods to filter on the 
partition column and producing a new pyarrow.Table. **This isn't optimal**, 
especially if they have many partitions within the in-memory table, and is 
precisely the reason why @jqin61 is investigating the different options in 
bucket sorting by partition within Arrow/Arrow Datasets.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to