[GitHub] [arrow] westonpace opened a new issue, #34213: [C++] Performance issue listing files over S3

via GitHub Wed, 15 Feb 2023 15:49:23 -0800


westonpace opened a new issue, #34213:
URL: https://github.com/apache/arrow/issues/34213


   ### Describe the enhancement requested
   
   The current GetFileInfo implementation (ignoring paging/continuation) in S3 
is roughly:
   
   ```
   def list_dir(path, results=[]):
     rsp = s3_list_objects(prefix=path, delimiter=/)
     parallel for common_prefix in rsp:
       list_dir(common_prefix, results)
     for file in rsp:
       results.append(file)
   ```
   
   The `prefix` and `delimiter` constructs are S3 constructs described in more 
detail [in S3 
docs](https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html)
   
   Since we use `/` as the delimiter this yields an experience that is very 
similar to "directory walking".  We issues one HTTP request for every single 
directory.
   
   Alternatively, one could simply do:
   
   ```
   def list_dir(path, results=[]):
     rsp = s3_list_objects(prefix=path)
     for file in rsp:
       results.append(file)
   ```
   
   This would guarantee one HTTP request regardless of how many directories 
there are.
   
   On the face of it, it would seem like this delimiter feature is always a bad 
idea (why would you want more HTTP requests?).  However, from reading some 
documentation, it seems that the point of the delimiter feature is to allow 
concurrent list objects calls.  However, this is expecting a situation where 
there are many many files (examples seem to consider millions) and they are 
more or less evenly distributed across prefixes (e.g. hundreds of thousands of 
files per prefix).  Furthemore, this appears to very geared to the "container 
in the same datacenter as the S3 bucket" situation where the per-request 
latency is very small.
   
   Some users are either using non-S3 technologies (e.g. minio) or they are 
downloading data from outside EC2 or they simply don't have very many parquet 
files per partition folder.
   
   This leads to very slow (20-25x slower in #34145) performance when 
discovering datasets in S3.
   
   I believe we should make the delimiter a property of the S3 filesystem and 
the default should be "no delimiter".  This ought to speed up the normal? case 
and still makes it possible to optimize for a case where a user has structured 
their dataset to benefit from delimiters.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace opened a new issue, #34213: [C++] Performance issue listing files over S3

Reply via email to