syun64 commented on issue #208:
URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1912560771

   Maybe another approach we could take if we want to use existing PyArrow 
functions is:
   1. table.sort_by (all partitions)
   2. figure out the row index for each permutation of partition groups by 
taking another pass through the table
   3. Use table.slice(index, length) with indexes we generated above to write 
out the tables using List[WriteTask] in write_file
   
   If there was an existing PyArrow API that gave us the outcome of (1) + (2) 
in one pass, it would have been the most optimal, but it seems like there 
isn't... so I think taking just one more pass to find the indices is maybe not 
the worst idea. 
   
   We could also argue that (1) should be a requirement that we check on the 
provided PyArrow table, rather than running the sort within the PyIceberg API.
   
   Please let me know your thoughts!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to