psavalle commented on PR #1995: URL: https://github.com/apache/iceberg-python/pull/1995#issuecomment-3288293547
@koenvo @Fokko from what I understand, with these changes, `ArrowScan.to_record_batches` now eagerly uses 32 threads to pre-fetch all of the data, which seems to defeat the purpose of returning an `Iterator[pa.RecordBatch]` and differs from the previous behavior, which was only reading data files as needed when iterating over the resulting iterator. I have observed much more significant CPU and memory usage since upgrading to `pyiceberg 0.10` and I suspect that is the reason. The new behavior seems fine for `ArrowScan.to_table` which needs all the data anyway, but not for `ArrowScan.to_record_batches`. Can the previous behavior be restored, or is there maybe a different public API that could be used instead of `to_record_batches` to incrementally read batches? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
