kinolaev opened a new pull request, #15792:
URL: https://github.com/apache/iceberg/pull/15792

   During vectorized Parquet reads, `S3InputStream` opens an unbounded HTTP 
range request (`bytes=pos-`) and reads one row group eagerly into memory. While 
Spark processes that in-memory row group (which can take several minutes for 
large batches), the client stops reading from S3. The TCP receive buffer fills 
up, and S3 eventually tears down the stalled connection.
   
   When the next row group read begins, the connection is already dead and 
Apache HTTP client throws `ConnectionClosedException: Premature end of 
Content-Length delimited message body (expected: x; received: y)` (when using 
[apache http 
client](https://github.com/apache/httpcomponents-core/blob/rel/v5.4.2/httpcore5/src/main/java/org/apache/hc/core5/http/impl/io/ContentLengthInputStream.java#L176-L178)).
 This only affects files with multiple row groups (typically >128 MB).
   
   The existing retry policy handles `SSLException`, `SocketTimeoutException`, 
and `SocketException`, but not this case. This PR extends the retry predicate 
to reopen the stream at the saved position when this specific exception is 
encountered, while leaving all other `ConnectionClosedException` variants (e.g. 
from `abort()`) unaffected.
   
   Fixes https://github.com/apache/iceberg/issues/9674 and 
https://github.com/apache/iceberg/issues/9679.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to