wjzhou opened a new issue, #41604: URL: https://github.com/apache/arrow/issues/41604
### Describe the bug, including details regarding any error messages, version, and platform. I'm using pyarrow.csv.open_csv to stream read a 15GB gz csv file over S3. The speed is unusable slow. ```python file_src = pyarrow_s3fs.open_input_stream(path_src) read_options = pyarrow.csv.ReadOptions(block_size=5_000_000, encoding="latin1") csv = pyarrow.csv.open_csv( file_src, read_options=read_options, parse_options=parse_options, convert_options=convert_options, ) for batch in self._csv: ... ``` Here, even when I set the block_size=5_000_000, the reader are issuing 65K ranged read over S3 This is bad for two reason: 1. the S3 has cost for each get request ($0.0004/req) 2. compute the authentication header etc has computation cost After digging into the code, I find this `kChunkSize` is hard coded in CompressedInputStream https://github.com/apache/arrow/blob/f6127a6d18af12ce18a0b8b1eac02346721cc399/cpp/src/arrow/io/compressed.cc#L432 Currently, my workaround is to use buffered stream `file_src = pyarrow_s3fs.open_input_stream(path_src, buffer_size=10_000_000)` But it is not obvious from the doc. Could we set this value higher? Or at least add som doc to clarify the usage. ### Component(s) C++, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org