mjs opened a new issue, #47239: URL: https://github.com/apache/arrow/issues/47239
### Describe the enhancement requested I've been looking at why Arrow's access of parquet files on an S3 store are slower when compared to Polars and ClickHouse. A packet capture highlighted what the problem is. For a single parquet file read the following S3 requests are made: - HEAD - HEAD - HEAD - Oversized ranged GET of the tail of the object to read the metadata block - HEAD - Ranged GETs to read the object data If there's any significant latency between the Arrow client and the S3 (which is likely), all these requests translate into a performance bottleneck. I'm using MinIO and there's a very noticable difference in overall read performance for a client that's 1ms away from the server and one that 30 ms away. That's 150 ms before any data is transfered when it could be 60 ms. The impact gets worse the further apart the client and server are with AWS S3 and GCS likely being the worst cases. Compare to what Polars does to read the same parquet file: - HEAD - 8 byte read at end to get metadata size - Precise tail read to get metadata - Ranged GETs from the start to read the table metadata Arguably it could be even smarter to just read the last 64KB and save a request instead of doing an exact read of the metadata. ClickHouse is smart when it comes to smaller objects, doing a HEAD and just grabbing the whole object in one go if the size is below some threshold. For larger objects, it does what Arrow does with too many HEAD requests (one less than Arrow). I've tried `allow_delayed_open` but this seems to make no difference to S3 read requests despite the documentation hinting that it might. `allow_delayed_open` does help with the efficiency of writing smaller objects though. Are there any plans to improve the efficient of Arrow's S3 reads? ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
