Fokko commented on issue #208:
URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1889891973

   I currently see two approaches:
   
   - First get the unique partitions, and then filter for each of the 
partitions the relevant data. It is nice that we know the partition upfront, 
but the downside is that we need to do a pass through the dataframe to get the 
relevant rows for that partition.
   - Sort the entire dataframe on the partition columns, which is expensive, 
but we need to do this only once. Once it is sorted, we can just do a single 
pass through the dataframe. If we want to scale out, we can split out the rows: 
`[0, n/2], [n/2, 2]`.
   
   I'm not sure what is the best. I think the first one works better if you 
have few partitions, and the latter one is more efficient when you have many 
partitions.
   
   > To extract the partitions by filter(), would it be helpful if we firstly 
build an API in pyarrow which does a full scan of the array and bucket-sorts it 
into partitions and returns buckets (partitions) as a list of arrow arrays? 
These arrays could be further passed as input to writing jobs which are 
executed in a multi-threading way.
   
   Starting with the API is always a great idea. My only concern is that we 
make sure that we don't take copies of the data, since that might blow up the 
memory quite quickly.
   
   Hope this helps!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to