kinolaev commented on PR #15792: URL: https://github.com/apache/iceberg/pull/15792#issuecomment-4155661890
> I think this stems from the fact that row groups are read sequentially, i had a pr a while back to prefetch the row groups and this would have potentially saved us in this scenario. > PR : https://github.com/apache/iceberg/pull/7279/changes If I understand it right, prefetching a row group wouldn't help. This problem is caused by an unbounded HTTP range request (`bytes=pos-`) and long row group processing time (time between `advance()` calls). I think, the reader should either continue reading a file while processing the first row group or make bounded range requests (with `S3InputStream.readFully`) for each row group. But unfortunately I didn't find a way to implement it. > do you know the case like your, reads ever success ? because if a fresh connection is established and we reprocess this task again from the begining we will still get into same situation ? I encountered this issue only while executing the rewrite_data_files procedure. This PR resolves it, the procedure no longer fails. We don't reprocess the task again from the beginning, we just open a new connection instead of already closed one to read the next row group. PS: @danielcweeks , this time I've double checked that the problem actually happens in my production environment before opening the PR) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
