kinolaev opened a new pull request, #15792: URL: https://github.com/apache/iceberg/pull/15792
During vectorized Parquet reads, `S3InputStream` opens an unbounded HTTP range request (`bytes=pos-`) and reads one row group eagerly into memory. While Spark processes that in-memory row group (which can take several minutes for large batches), the client stops reading from S3. The TCP receive buffer fills up, and S3 eventually tears down the stalled connection. When the next row group read begins, the connection is already dead and Apache HTTP client throws `ConnectionClosedException: Premature end of Content-Length delimited message body (expected: x; received: y)` (when using [apache http client](https://github.com/apache/httpcomponents-core/blob/rel/v5.4.2/httpcore5/src/main/java/org/apache/hc/core5/http/impl/io/ContentLengthInputStream.java#L176-L178)). This only affects files with multiple row groups (typically >128 MB). The existing retry policy handles `SSLException`, `SocketTimeoutException`, and `SocketException`, but not this case. This PR extends the retry predicate to reopen the stream at the saved position when this specific exception is encountered, while leaving all other `ConnectionClosedException` variants (e.g. from `abort()`) unaffected. Fixes https://github.com/apache/iceberg/issues/9674 and https://github.com/apache/iceberg/issues/9679. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
