Re: [PR] feat: Add ScanOrder and concurrent_files to ArrowScan for bounded-memory reads [iceberg-python]

via GitHub Wed, 18 Feb 2026 11:15:04 -0800


robreeves commented on code in PR #3046:
URL: https://github.com/apache/iceberg-python/pull/3046#discussion_r2823390913



##########
pyiceberg/io/pyarrow.py:
##########
@@ -1710,6 +1685,76 @@ def _read_all_delete_files(io: FileIO, tasks: 
Iterable[FileScanTask]) -> dict[st
     return deletes_per_file
 
 
+_QUEUE_SENTINEL = object()
+
+
+def _bounded_concurrent_batches(
+    tasks: list[FileScanTask],
+    batch_fn: Callable[[FileScanTask], Iterator[pa.RecordBatch]],
+    concurrent_files: int,
+    max_buffered_batches: int = 16,

Review Comment:
   I don't think this should be hardcoded. This means the only control a user 
has over the memory is lowering the batch size or making files with less than 
16 batches and setting `concurrent_files`. It feels like they have to be too in 
the weeds to tune this.
   
   Instead, WDYT you think of consolidating `max_buffered_batches` and 
`concurrent_files`? Instead have a single `parallelism` config. It would be 
used for the max worker count and number of batches. In the end the caller 
still gets the same number of parallel batches loaded.
   
   One downside is it could result in a large number of files open if they want 
high parallelism. So a simpler alternative could be to expose 
`max_buffered_batches` so a user can set it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Add ScanOrder and concurrent_files to ArrowScan for bounded-memory reads [iceberg-python]

Reply via email to