syun64 commented on issue #208:
URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1912742015

   @jqin61 and I discussed this a great deal offline, and we just wanted to 
follow up on step (2). If we wanted to use existing PyArrow functions, I think 
we could use a 2 pass algorithm to figure out the row index of each permutation 
of partition groups on a partition-sorted pyarrow table:
   
   ```
   import pyarrow as pa
   import pyarrow.dataset
   import pyarrow.compute
   
   # pyarrow table, which is already sorted by partitions 'year' and 'animals'
   pylist = [{'year': 2021, 'animals': 'Bear'}, {'year': 2021, 'animals': 
'Bear'}, {'year': 2021, 'animals': 'Centipede'}, {'year': 2022, 'animals': 
'Cat'}, {'year': 2022, 'animals': 'Flamingo'},{'year': 2023, 'animals': 'Dog'}]
   table = pa.Table.from_pylist(pylist)
   
   # assert that the table is sorted by checking sort_indices are in order
   pa.compute.sort_indices(table, sort_keys=[('year', "ascending"), ("animals", 
"ascending")])
   <pyarrow.lib.UInt64Array object at 0x7fe9b0f4c340>
   [
     0,
     1,
     2,
     3,
     4,
     5
   ]
   
   # then sort the same list of partitions in opposite order, and check the 
indices to figure out the offsets and lengths of each partition group. If a 
larger number comes before a smaller index, that's the starting offset of the 
partition group. the number of consecutive number of indices that are in 
correct ascending order is the length of that partition group.
   pa.compute.sort_indices(table, sort_keys=[('year', "descending"), 
("animals", "descending")])
   <pyarrow.lib.UInt64Array object at 0x7fe9b0f3ff40>
   [
     5,
     4,
     3,
     2,
     0,
     1
   ]
   
   # from above, we get the following partition group:
   partition_slices = [(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]
   
   for offset, length in partition_slices:
       write_file(iceberg_table, iter([WriteTask(write_uuid, next(counter), 
table.slice(offset, length))]))
   ```
   
   Then, how do we handle transformed partitions? I think going back to the 
previous idea, we could create intermediate helper columns to generate the 
transformed partition values in order to use them for sorting. We can keep 
track of these columns and ensure that we drop the column after we use the 
above algorithm to split up the table by partition slices.
   
   Regardless of whether we choose to support sorting within the PyIceberg 
write API or have it as a requirement, maybe we can create a helper function 
that takes the PartitionSpec of the iceberg table and the pyarrow table and 
makes sure that the table is **sortedByPartition** by using the above method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to