eeroel opened a new issue, #37917: URL: https://github.com/apache/arrow/issues/37917
### Describe the enhancement requested As discussed [here](https://github.com/apache/arrow/pull/37868), Parquet reader uses a synchronous FileSource::Open when starting to read a file: https://github.com/apache/arrow/blob/e038498c70207df1ac64b1aa276a5fd5e3cd306b/cpp/src/arrow/dataset/file_parquet.cc#L482 In the case of reading from S3, this Open method makes a HEAD request, which delays the start of processing of all other files when reading from a dataset. Here are two illustrative images of the performance difference, reading a FileSystemDataset of 16 files with ~200K rows in total, in Pyarrow, with maximum fragment_readahead, pre_buffer=True. Files are on the y-axis and time is on the x-axis, threads are colored. Each point represents one request (HEAD or GET) to S3. Here's the current behavior, where the first request for each file is processed on the blue thread: <img width="1145" alt="Screenshot 2023-09-27 at 23 05 06" src="https://github.com/apache/arrow/assets/10564706/d48c0761-4be0-4dd5-8691-94822de6f4bd"> And here's the behavior with a WIP implementation of OpenAsync (note differen x-axis scaling) <img width="1148" alt="Screenshot 2023-09-27 at 23 04 49" src="https://github.com/apache/arrow/assets/10564706/b6a86960-b3e7-4e5e-ab4a-13d935456679"> In both images the first two (blue) points are from a separate request for one file to get the schema. It's still a bit of a mystery to me why in the async case the concurrent requests start only after the fourth request, seems like there could be some performance to be gained there as well. ### Component(s) Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
