shanielh commented on PR #11895:
URL: https://github.com/apache/iceberg/pull/11895#issuecomment-2568987952

   > I wonder if this is as important if we switch ParallelIterable to use the 
implementation suggested here https://github.com/apache/iceberg/issues/11768 
which limits the queue depth significantly and changes the yielding behavior.
   > 
   > 
   > 
   > I think it's a good perf change here but I do worry about disconnecting 
the poll/push operations from actually changing the size tracker for the queue. 
We probably aren't actually going to have any issues here though since we are 
already check the size as basically random times without regard to ongoing 
concurrent operations. 
   
   Since we poll the size and it's a concurrent data structure, it doesn't 
really matter if the size is accurate or not, but eventually it is accurate. 
   
   As for #11768, we use a different S3FileIO which uses a different mechanism 
for InputStream, instead of keeping the connection open against S3, we download 
chunks of data and store it in the memory (on demand, of course). This way we 
can use ParallelIterable without having to think on the number of connections 
against S3. This will increase the cost as you might download a file using 
multiple GET calls instead of one, but allows you to run long lasting 
InputStream(s). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to