Re: [PR] Improve upsert memory pressure [iceberg-python]

via GitHub Sat, 13 Sep 2025 05:34:47 -0700


psavalle commented on PR #1995:
URL: https://github.com/apache/iceberg-python/pull/1995#issuecomment-3288293547


   @koenvo @Fokko from what I understand, with these changes, 
`ArrowScan.to_record_batches` now eagerly uses 32 threads to pre-fetch all of 
the data, which seems to defeat the purpose of returning an 
`Iterator[pa.RecordBatch]` and differs from the previous behavior, which was 
only reading data files as needed when iterating over the resulting iterator.
   
   I have observed much more significant CPU and memory usage since upgrading 
to `pyiceberg 0.10` and I suspect that is the reason. The new behavior seems 
fine for `ArrowScan.to_table` which needs all the data anyway, but not for 
`ArrowScan.to_record_batches`. Can the previous behavior be restored, or is 
there maybe a different public API that could be used instead of 
`to_record_batches` to incrementally read batches?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Improve upsert memory pressure [iceberg-python]

Reply via email to