shanielh commented on PR #11895: URL: https://github.com/apache/iceberg/pull/11895#issuecomment-2568987952
> I wonder if this is as important if we switch ParallelIterable to use the implementation suggested here https://github.com/apache/iceberg/issues/11768 which limits the queue depth significantly and changes the yielding behavior. > > > > I think it's a good perf change here but I do worry about disconnecting the poll/push operations from actually changing the size tracker for the queue. We probably aren't actually going to have any issues here though since we are already check the size as basically random times without regard to ongoing concurrent operations. Since we poll the size and it's a concurrent data structure, it doesn't really matter if the size is accurate or not, but eventually it is accurate. As for #11768, we use a different S3FileIO which uses a different mechanism for InputStream, instead of keeping the connection open against S3, we download chunks of data and store it in the memory (on demand, of course). This way we can use ParallelIterable without having to think on the number of connections against S3. This will increase the cost as you might download a file using multiple GET calls instead of one, but allows you to run long lasting InputStream(s). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org