wjzhou opened a new issue, #41604:
URL: https://github.com/apache/arrow/issues/41604
### Describe the bug, including details regarding any error messages,
version, and platform.
I'm using pyarrow.csv.open_csv to stream read a 15GB gz csv file over S3.
The speed is unusable slow.
```python
file_src = pyarrow_s3fs.open_input_stream(path_src)
read_options = pyarrow.csv.ReadOptions(block_size=5_000_000,
encoding="latin1")
csv = pyarrow.csv.open_csv(
file_src,
read_options=read_options,
parse_options=parse_options,
convert_options=convert_options,
)
for batch in self._csv:
...
```
Here, even when I set the block_size=5_000_000, the reader are issuing 65K
ranged read over S3
This is bad for two reason:
1. the S3 has cost for each get request ($0.0004/req)
2. compute the authentication header etc has computation cost
After digging into the code, I find this `kChunkSize` is hard coded in
CompressedInputStream
https://github.com/apache/arrow/blob/f6127a6d18af12ce18a0b8b1eac02346721cc399/cpp/src/arrow/io/compressed.cc#L432
Currently, my workaround is to use buffered stream
`file_src = pyarrow_s3fs.open_input_stream(path_src, buffer_size=10_000_000)`
But it is not obvious from the doc. Could we set this value higher? Or at
least add som doc to clarify the usage.
### Component(s)
C++, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]