[I] Default value for CompressedInputStream kChunkSize might be too small [arrow]

via GitHub Thu, 09 May 2024 09:46:02 -0700


wjzhou opened a new issue, #41604:
URL: https://github.com/apache/arrow/issues/41604

### Describe the bug, including details regarding any error messages,
version, and platform.

I'm using pyarrow.csv.open_csv to stream read a 15GB gz csv file over S3.
The speed is unusable slow.

```python

file_src = pyarrow_s3fs.open_input_stream(path_src)

read_options = pyarrow.csv.ReadOptions(block_size=5_000_000,
encoding="latin1")
csv = pyarrow.csv.open_csv(
file_src,
read_options=read_options,
parse_options=parse_options,
convert_options=convert_options,
)
for batch in self._csv:
...
```

Here, even when I set the block_size=5_000_000, the reader are issuing 65K
ranged read over S3

This is bad for two reason:
1. the S3 has cost for each get request ($0.0004/req)
2. compute the authentication header etc has computation cost

After digging into the code, I find this `kChunkSize` is hard coded in
CompressedInputStream
https://github.com/apache/arrow/blob/f6127a6d18af12ce18a0b8b1eac02346721cc399/cpp/src/arrow/io/compressed.cc#L432

Currently, my workaround is to use buffered stream
`file_src = pyarrow_s3fs.open_input_stream(path_src, buffer_size=10_000_000)`

But it is not obvious from the doc. Could we set this value higher? Or at
least add som doc to clarify the usage.

### Component(s)

C++, Python

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Default value for CompressedInputStream kChunkSize might be too small [arrow]

Reply via email to