syun64 commented on issue #208: URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1912560771
Maybe another approach we could take if we want to use existing PyArrow functions is: 1. table.sort_by (all partitions) 2. figure out the row index for each permutation of partition groups by taking another pass through the table 3. Use table.slice(index, length) with indexes we generated above to write out the tables using List[WriteTask] in write_file If there was an existing PyArrow API that gave us the outcome of (1) + (2) in one pass, it would have been the most optimal, but it seems like there isn't... so I think taking just one more pass to find the indices is maybe not the worst idea. We could also argue that (1) should be a requirement that we check on the provided PyArrow table, rather than running the sort within the PyIceberg API. Please let me know your thoughts! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org