[GitHub] [arrow] eeroel opened a new issue, #37917: [Parquet] Implement OpenAsync for FileSource

via GitHub Wed, 27 Sep 2023 13:28:21 -0700


eeroel opened a new issue, #37917:
URL: https://github.com/apache/arrow/issues/37917


   ### Describe the enhancement requested
   
   As discussed [here](https://github.com/apache/arrow/pull/37868), Parquet 
reader uses a synchronous FileSource::Open when starting to read a file: 
https://github.com/apache/arrow/blob/e038498c70207df1ac64b1aa276a5fd5e3cd306b/cpp/src/arrow/dataset/file_parquet.cc#L482
   
   In the case of reading from S3, this Open method makes a HEAD request, which 
delays the start of processing of all other files when reading from a dataset.
   
   Here are two illustrative images of the performance difference, reading a 
FileSystemDataset of 16 files with ~200K rows in total, in Pyarrow, with 
maximum fragment_readahead, pre_buffer=True. Files are on the y-axis and time 
is on the x-axis, threads are colored. Each point represents one request (HEAD 
or GET) to S3.
   
   Here's the current behavior, where the first request for each file is 
processed on the blue thread:
   <img width="1145" alt="Screenshot 2023-09-27 at 23 05 06" 
src="https://github.com/apache/arrow/assets/10564706/d48c0761-4be0-4dd5-8691-94822de6f4bd";>
   
   And here's the behavior with a WIP implementation of OpenAsync (note 
differen x-axis scaling)
   <img width="1148" alt="Screenshot 2023-09-27 at 23 04 49" 
src="https://github.com/apache/arrow/assets/10564706/b6a86960-b3e7-4e5e-ab4a-13d935456679";>
   
   In both images the first two (blue) points are from a separate request for 
one file to get the schema. It's still a bit of a mystery to me why in the 
async case the concurrent requests start only after the fourth request, seems 
like there could be some performance to be gained there as well.
   
   
   
   
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] eeroel opened a new issue, #37917: [Parquet] Implement OpenAsync for FileSource

Reply via email to