syun64 commented on issue #208: URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1912742015
@jqin61 and I discussed this a great deal offline, and we just wanted to follow up on step (2). If we wanted to use existing PyArrow functions, I think we could use a 2 pass algorithm to figure out the row index of each permutation of partition groups on a partition-sorted pyarrow table: ``` import pyarrow as pa import pyarrow.dataset import pyarrow.compute # pyarrow table, which is already sorted by partitions 'year' and 'animals' pylist = [{'year': 2021, 'animals': 'Bear'}, {'year': 2021, 'animals': 'Bear'}, {'year': 2021, 'animals': 'Centipede'}, {'year': 2022, 'animals': 'Cat'}, {'year': 2022, 'animals': 'Flamingo'},{'year': 2023, 'animals': 'Dog'}] table = pa.Table.from_pylist(pylist) # assert that the table is sorted by checking sort_indices are in order pa.compute.sort_indices(table, sort_keys=[('year', "ascending"), ("animals", "ascending")]) <pyarrow.lib.UInt64Array object at 0x7fe9b0f4c340> [ 0, 1, 2, 3, 4, 5 ] # then sort the same list of partitions in opposite order, and check the indices to figure out the offsets and lengths of each partition group. If a larger number comes before a smaller index, that's the starting offset of the partition group. the number of consecutive number of indices that are in correct ascending order is the length of that partition group. pa.compute.sort_indices(table, sort_keys=[('year', "descending"), ("animals", "descending")]) <pyarrow.lib.UInt64Array object at 0x7fe9b0f3ff40> [ 5, 4, 3, 2, 0, 1 ] # from above, we get the following partition group: partition_slices = [(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)] for offset, length in partition_slices: write_file(iceberg_table, iter([WriteTask(write_uuid, next(counter), table.slice(offset, length))])) ``` Then, how do we handle transformed partitions? I think going back to the previous idea, we could create intermediate helper columns to generate the transformed partition values in order to use them for sorting. We can keep track of these columns and ensure that we drop the column after we use the above algorithm to split up the table by partition slices. Regardless of whether we choose to support sorting within the PyIceberg write API or have it as a requirement, maybe we can create a helper function that takes the PartitionSpec of the iceberg table and the pyarrow table and makes sure that the table is **sortedByPartition** by using the above method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org