jqin61 commented on issue #208:
URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1889656103

   >How are we going to fan out the writing of the data. We have an Arrow 
table, what is an efficient way to compute the partitions and scale out the 
work. For example, are we going to sort the table on the partition column and 
do a full pass through it? Or are we going to compute all the affected 
partitions, and then scale out?
   
   It just comes to me that when spark writes to iceberg, it requires the input 
dataframe to be sorted by the partition value otherwise an error will be raised 
during writing. Do we want to take the same assumption for pyiceberg?
   
   If not, if we have to use arrow.compute.filter() to extract each partition 
before serialization, it seems a global sort before the filter() on the entire 
table is unnecessary since the filter makes no assumption of the array 
organization? It would be helpful if we have an API which uses one scan to 
bucket sort an arrow array into partitions and return buckets (partitions) as a 
list of arrow arrays.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to