syun64 commented on issue #208:
URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1912742015
@jqin61 and I discussed this a great deal offline, and we just wanted to
follow up on step (2). If we wanted to use existing PyArrow functions, I think
we could use a 2 pass algorithm to figure out the row index of each permutation
of partition groups on a partition-sorted pyarrow table:
```
import pyarrow as pa
import pyarrow.dataset
import pyarrow.compute
# pyarrow table, which is already sorted by partitions 'year' and 'animals'
pylist = [{'year': 2021, 'animals': 'Bear'}, {'year': 2021, 'animals':
'Bear'}, {'year': 2021, 'animals': 'Centipede'}, {'year': 2022, 'animals':
'Cat'}, {'year': 2022, 'animals': 'Flamingo'},{'year': 2023, 'animals': 'Dog'}]
table = pa.Table.from_pylist(pylist)
# assert that the table is sorted by checking sort_indices are in order
pa.compute.sort_indices(table, sort_keys=[('year', "ascending"), ("animals",
"ascending")])
<pyarrow.lib.UInt64Array object at 0x7fe9b0f4c340>
[
0,
1,
2,
3,
4,
5
]
# then sort the same list of partitions in opposite order, and check the
indices to figure out the offsets and lengths of each partition group. If a
larger number comes before a smaller index, that's the starting offset of the
partition group. the number of consecutive number of indices that are in
correct ascending order is the length of that partition group.
pa.compute.sort_indices(table, sort_keys=[('year', "descending"),
("animals", "descending")])
<pyarrow.lib.UInt64Array object at 0x7fe9b0f3ff40>
[
5,
4,
3,
2,
0,
1
]
# from above, we get the following partition group:
partition_slices = [(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]
for offset, length in partition_slices:
write_file(iceberg_table, iter([WriteTask(write_uuid, next(counter),
table.slice(offset, length))]))
```
Then, how do we handle transformed partitions? I think going back to the
previous idea, we could create intermediate helper columns to generate the
transformed partition values in order to use them for sorting. We can keep
track of these columns and ensure that we drop the column after we use the
above algorithm to split up the table by partition slices.
Regardless of whether we choose to support sorting within the PyIceberg
write API or have it as a requirement, maybe we can create a helper function
that takes the PartitionSpec of the iceberg table and the pyarrow table and
makes sure that the table is **sortedByPartition** by using the above method.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]