westonpace opened a new issue, #34213:
URL: https://github.com/apache/arrow/issues/34213
### Describe the enhancement requested
The current GetFileInfo implementation (ignoring paging/continuation) in S3
is roughly:
```
def list_dir(path, results=[]):
rsp = s3_list_objects(prefix=path, delimiter=/)
parallel for common_prefix in rsp:
list_dir(common_prefix, results)
for file in rsp:
results.append(file)
```
The `prefix` and `delimiter` constructs are S3 constructs described in more
detail [in S3
docs](https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html)
Since we use `/` as the delimiter this yields an experience that is very
similar to "directory walking". We issues one HTTP request for every single
directory.
Alternatively, one could simply do:
```
def list_dir(path, results=[]):
rsp = s3_list_objects(prefix=path)
for file in rsp:
results.append(file)
```
This would guarantee one HTTP request regardless of how many directories
there are.
On the face of it, it would seem like this delimiter feature is always a bad
idea (why would you want more HTTP requests?). However, from reading some
documentation, it seems that the point of the delimiter feature is to allow
concurrent list objects calls. However, this is expecting a situation where
there are many many files (examples seem to consider millions) and they are
more or less evenly distributed across prefixes (e.g. hundreds of thousands of
files per prefix). Furthemore, this appears to very geared to the "container
in the same datacenter as the S3 bucket" situation where the per-request
latency is very small.
Some users are either using non-S3 technologies (e.g. minio) or they are
downloading data from outside EC2 or they simply don't have very many parquet
files per partition folder.
This leads to very slow (20-25x slower in #34145) performance when
discovering datasets in S3.
I believe we should make the delimiter a property of the S3 filesystem and
the default should be "no delimiter". This ought to speed up the normal? case
and still makes it possible to optimize for a case where a user has structured
their dataset to benefit from delimiters.
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]