orf opened a new issue, #40589:
URL: https://github.com/apache/arrow/issues/40589

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I am using the S3 Filesystem abstractions to process a large Parquet 
dataset. The job reads data in batches and incrementally produces a new set of 
Parquet files. During the execution it checkpoints it's work.
   
   The checkpointing code is fairly rudamentary: we incrementally write a 
Parquet file to the `checkpoint/` prefix, and periodically copy it to the 
`output/` directory using `fs.move()`. We then write a `.json` file to the 
checkpoint directory detailing the current progress. Not perfect, but works 
well enough for our needs.
   
   After implementing this I analyzed the requests to the bucket and found a 
surprising number of requests where being made. For one job, `520,752` requests 
where made to the `checkpointing/` prefix in S3, whereas only `74,331` requests 
where made to read + write the dataset. This is a significant discrepency - 
nearly 7x the number of requests!
   
   The following is a table of requests to the checkpoint prefix broken down by 
type:
   
   | operation | total |
   | :--- | :--- |
   | REST.GET.OBJECT | 91871 |
   | REST.POST.UPLOADS | 84285 |
   | REST.PUT.PART | 84262 |
   | REST.POST.UPLOAD | 84261 |
   | REST.COPY.OBJECT | 42131 |
   | REST.COPY.OBJECT\_GET | 42131 |
   | REST.PUT.OBJECT | 42129 |
   | REST.DELETE.OBJECT | 42129 |
   | REST.HEAD.OBJECT | 7553 |
   
   I've dug into this and here are the reasons why:
   
   `126,388` of the `REST.PUT.PART`, `REST.POST.UPLOADS` and `REST.POST.UPLOAD` 
requests are made while writing the small `.json` checkpoint files. Due [to 
this issue](https://github.com/apache/arrow/issues/40557), despite being less 
than 1kb in size and few in number each creation of the `.json` checkpoint 
files requires 3 requests to S3.
   
   A further `126,420` requests are also multipart uploads, whilst creating 
Parquet files. Looking at the statistics, ~50% of the multipart upload requests 
could have been avoided with an initial 30 megabyte buffer before initiating a 
multipart upload.
   
   The `42,129` requests with the type `REST.PUT.OBJECT` are due to the 
implementation of `move()` and `delete()`: it currently attempts to [re-create 
the parent 
directory](https://github.com/apache/arrow/blob/b448b33808f2dd42866195fa4bb44198e2fc26b9/cpp/src/arrow/filesystem/s3fs.cc#L2849)
 after a copy or delete - this is because if there is only a single file in the 
prefix and we move/delete it, then the prefix will no longer exist. The 
workaround as implemented is to create a 0-sized object with the name of the 
prefix, ensuring that it still "exists".
   
   The `7,553` requests with the type `REST.HEAD.OBJECT` comes in part from 
from the implementation of `DeleteObject`, where [we make a HeadObject request 
before deleting a 
key](https://github.com/apache/arrow/blob/b448b33808f2dd42866195fa4bb44198e2fc26b9/cpp/src/arrow/filesystem/s3fs.cc#L2805).
   
   # Performance with versioned buckets
   
   While it's noble to attempt to create a proper "filesystem" facade over S3, 
there are inherent issues with this in terms of cost and performance.
   
   One major thing that worries me [is the 
`EnsureparentExists()`](https://github.com/apache/arrow/blob/b448b33808f2dd42866195fa4bb44198e2fc26b9/cpp/src/arrow/filesystem/s3fs.cc#L2521)
 that is called from `DeleteDir`, `DeleteFile` and `Move` methods. In a 
versioned bucket, this will repeatedly create empty keys to mimic a directory.
   
   Given a pathalogical case where you do something like this from multiple 
processes/threads:
   
   ```python
   sfs = fs.S3FileSystem()
   
   for _ in range(10000):
       path = f"a_bucket/some_directory/{uuid.uuid4()}"
       with sfs.open_output_stream(path) as fd:
           fd.write(b'hi')
       sfs.move(path, f"a_bucket/some_other_directory/{uuid.uuid4()}")
   ```
   
   Then we will end up with many tens of thousands of versioned objects with 
the key `some_directory/`.
   
   One lesser-known thing about S3 versioned buckets is that `list_objects_v2` 
(and `list_objects`) calls have to skip over deleted/noncurrent versions of 
objects. While this is fast, it can get _very_ slow - I've seen prefixes take 
over a minute to list objects due to the number of noncurrent/deleted objects.
   
   Obviously lifecycle policies can clean these out, but that only happens once 
a day or so. 
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to