psavalle commented on PR #1995:
URL: https://github.com/apache/iceberg-python/pull/1995#issuecomment-3289304023

   It doesn't look like this would solve the problem: even with a single 
thread, the new implementation still seems to pre-fetch all of the data in 
memory, irrespective of whether the iterator of record batches in being 
consumed. If the scan has more data files than there is memory available, it 
would still run out of memory.
   
   I think the point of returning an `Iterator[pa.RecordBatch]` is that we 
should only fetch the next batch when we try to consume the next item from the 
iterator. For performance, it might be useful to still allow pre-fetching the 
next batch in the background, but ideally with an explicitly configurable 
parameter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to