[PR] Arrow: Avoid buffer-overflow by avoid doing a sort [iceberg-python]

via GitHub Mon, 20 Jan 2025 03:59:57 -0800


Fokko opened a new pull request, #1539:
URL: https://github.com/apache/iceberg-python/pull/1539


   This was already being discussed back here: 
https://github.com/apache/iceberg-python/issues/208#issuecomment-1889891973
   
   This PR changes from doing a sort, and then a single pass over the table to 
the approach where we determine the unique partition tuples filter on them 
individually.
   
   Fixes https://github.com/apache/iceberg-python/issues/1491
   
   Because the sort caused buffers to be joined where it would overflow in 
Arrow. I think this is an issue on the Arrow side, and it should automatically 
break up into smaller buffers. The `combine_chunks` method does this correctly.
   
   Now:
   
   ```
   0.42877754200890195
   Run 1 took: 0.2507691659993725
   Run 2 took: 0.24833179199777078
   Run 3 took: 0.24401691700040828
   Run 4 took: 0.2419595829996979
   Average runtime of 0.28 seconds
   ```
   
   Before:
   
   ```
   Run 0 took: 1.0768639159941813
   Run 1 took: 0.8784021250030492
   Run 2 took: 0.8486490420036716
   Run 3 took: 0.8614017910003895
   Run 4 took: 0.8497851670108503
   Average runtime of 0.9 seconds
   ```
   
   So it comes with a nice speedup as well :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Arrow: Avoid buffer-overflow by avoid doing a sort [iceberg-python]

Reply via email to