jqin61 commented on issue #208: URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1889656103
>How are we going to fan out the writing of the data. We have an Arrow table, what is an efficient way to compute the partitions and scale out the work. For example, are we going to sort the table on the partition column and do a full pass through it? Or are we going to compute all the affected partitions, and then scale out? It just comes to me that when spark writes to iceberg, it requires the input dataframe to be sorted by the partition value otherwise an error will be raised during writing. Do we want to take the same assumption for pyiceberg? If not, if we have to use arrow.compute.filter() to extract each partition before serialization, it seems a global sort before the filter() on the entire table is unnecessary since the filter makes no assumption of the array organization? It would be helpful if we have an API which uses one scan to bucket sort an arrow array into partitions and return buckets (partitions) as a list of arrow arrays. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org