Re: [PR] do not materialize entire files in `to_record_batches` [iceberg-python]

via GitHub Mon, 03 Nov 2025 10:43:14 -0800


tom-at-rewbi commented on code in PR #2676:
URL: https://github.com/apache/iceberg-python/pull/2676#discussion_r2487391312



##########
pyiceberg/io/pyarrow.py:
##########
@@ -1726,16 +1726,10 @@ def to_record_batches(self, tasks: 
Iterable[FileScanTask]) -> Iterator[pa.Record
         deletes_per_file = _read_all_delete_files(self._io, tasks)
 
         total_row_count = 0
-        executor = ExecutorFactory.get_or_create()
-
-        def batches_for_task(task: FileScanTask) -> List[pa.RecordBatch]:
-            # Materialize the iterator here to ensure execution happens within 
the executor.
-            # Otherwise, the iterator would be lazily consumed later (in the 
main thread),
-            # defeating the purpose of using executor.map.
-            return 
list(self._record_batches_from_scan_tasks_and_deletes([task], deletes_per_file))

Review Comment:
   If I'm not mistaken, reading is still parallel after this change, albeit in 
a different way. Every time a scan task is processed, 
`pyarrow.dataset.Scanner.from_fragment` is called with the default arguments 
for `use_threads`, which has a default of maximum parallelism. [(link to 
source)](https://github.com/tom-at-rewbi/iceberg-python/blob/c8b76875bad3b81dc8e9b4ee9ec9c459b3babd8a/pyiceberg/io/pyarrow.py#L1557)
   
   pyarrow.dataset.Scanner also has configurable options for readahead (also 
enabled by default), which I _assume_ are parallelized, but I am not sure.
   
    Are there any situations where we would expect stacking these two forms of 
parallelism to be beneficial?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] do not materialize entire files in `to_record_batches` [iceberg-python]

Reply via email to