[I] S3 read request patterns are very inefficent [arrow]

via GitHub Wed, 30 Jul 2025 18:06:41 -0700


mjs opened a new issue, #47239:
URL: https://github.com/apache/arrow/issues/47239


   ### Describe the enhancement requested
   
   I've been looking at why Arrow's access of parquet files on an S3 store are 
slower when compared to Polars and ClickHouse.  A packet capture highlighted 
what the problem is. For a single parquet file read the following S3 requests 
are made:
   
   - HEAD 
   - HEAD
   - HEAD
   - Oversized ranged GET of the tail of the object to read the metadata block
   - HEAD
   - Ranged GETs to read the object data
   
   If there's any significant latency between the Arrow client and the S3 
(which is likely), all these requests translate into a performance bottleneck.  
I'm using MinIO and there's a very noticable difference in overall read 
performance for a client that's 1ms away from the server and one that 30 ms 
away. That's 150 ms before any data is transfered when it could be 60 ms. The 
impact gets worse the further apart the client and server are with AWS S3 and 
GCS likely being the worst cases.
   
   Compare to what Polars does to read the same parquet file:
   
   - HEAD
   - 8 byte read at end to get metadata size
   - Precise tail read to get metadata
   - Ranged GETs from the start to read the table metadata
   
   Arguably it could be even smarter to just read the last 64KB and save a 
request instead of doing an exact read of the metadata.
   
   ClickHouse is smart when it comes to smaller objects, doing a HEAD and just 
grabbing the whole object in one go if the size is below some threshold. For 
larger objects, it does what Arrow does with too many HEAD requests (one less 
than Arrow).
   
   I've tried `allow_delayed_open` but this seems to make no difference to S3 
read requests despite the documentation hinting that it might. 
`allow_delayed_open` does help with the efficiency of writing smaller objects 
though. 
   
   Are there any plans to improve the efficient of Arrow's S3 reads? 
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] S3 read request patterns are very inefficent [arrow]

Reply via email to