wjzhou opened a new issue, #41604:
URL: https://github.com/apache/arrow/issues/41604

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I'm using pyarrow.csv.open_csv to stream read a 15GB gz csv file over S3. 
The speed is unusable slow.
   
   ```python
   
   file_src = pyarrow_s3fs.open_input_stream(path_src)
   
   read_options = pyarrow.csv.ReadOptions(block_size=5_000_000, 
encoding="latin1")
   csv = pyarrow.csv.open_csv(
               file_src,
               read_options=read_options,
               parse_options=parse_options,
               convert_options=convert_options,
           )
   for batch in self._csv:
       ... 
   ```
   
   Here, even when I set the block_size=5_000_000, the reader are  issuing 65K 
ranged read over S3 
   
   This is bad for two reason:
   1. the S3 has cost for each get request ($0.0004/req)
   2. compute the authentication header etc has computation cost
   
   After digging into the code, I find this `kChunkSize` is hard coded in 
CompressedInputStream 
https://github.com/apache/arrow/blob/f6127a6d18af12ce18a0b8b1eac02346721cc399/cpp/src/arrow/io/compressed.cc#L432
   
   Currently, my workaround is to use buffered stream
   `file_src = pyarrow_s3fs.open_input_stream(path_src, buffer_size=10_000_000)`
   
   But it is not obvious from the doc. Could we set this value higher? Or at 
least add som doc to clarify the usage.
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to