kinolaev commented on PR #15792:
URL: https://github.com/apache/iceberg/pull/15792#issuecomment-4155661890

   > I think this stems from the fact that row groups are read sequentially, i 
had a pr a while back to prefetch the row groups and this would have 
potentially saved us in this scenario.
   > PR : https://github.com/apache/iceberg/pull/7279/changes
   
   If I understand it right, prefetching a row group wouldn't help. This 
problem is caused by an unbounded HTTP range request (`bytes=pos-`) and long 
row group processing time (time between `advance()` calls). I think, the reader 
should either continue reading a file while processing the first row group or 
make bounded range requests (with `S3InputStream.readFully`) for each row 
group. But unfortunately I didn't find a way to implement it.
   
   > do you know the case like your, reads ever success ? because if a fresh 
connection is established and we reprocess this task again from the begining we 
will still get into same situation ?
   
   I encountered this issue only while executing the rewrite_data_files 
procedure. This PR resolves it, the procedure no longer fails. We don't 
reprocess the task again from the beginning, we just open a new connection 
instead of already closed one to read the next row group.
   
   PS: @danielcweeks , this time I've double checked that the problem actually 
happens in my production environment before opening the PR)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to