Fokko commented on issue #208: URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1889891973
I currently see two approaches: - First get the unique partitions, and then filter for each of the partitions the relevant data. It is nice that we know the partition upfront, but the downside is that we need to do a pass through the dataframe to get the relevant rows for that partition. - Sort the entire dataframe on the partition columns, which is expensive, but we need to do this only once. Once it is sorted, we can just do a single pass through the dataframe. If we want to scale out, we can split out the rows: `[0, n/2], [n/2, 2]`. I'm not sure what is the best. I think the first one works better if you have few partitions, and the latter one is more efficient when you have many partitions. > To extract the partitions by filter(), would it be helpful if we firstly build an API in pyarrow which does a full scan of the array and bucket-sorts it into partitions and returns buckets (partitions) as a list of arrow arrays? These arrays could be further passed as input to writing jobs which are executed in a multi-threading way. Starting with the API is always a great idea. My only concern is that we make sure that we don't take copies of the data, since that might blow up the memory quite quickly. Hope this helps! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org