wence- opened a new issue, #34118: URL: https://github.com/apache/arrow/issues/34118
### Describe the enhancement requested When calling `DoInitializeS3`, arrow creates initialises the AWS API, which by default creates a thread pool for the background AWS event loop that uses one thread per physical core on the system. This is particularly noticeable when using pyarrow where merely importing `from pyarrow.fs import PyFileSystem` will cause eager initialisation of the AWS API, so you're paying (something) for these threads even when you have no intention of using them. This is rather unfriendly when running a multi-process or some otherhow parallelised process on a multicore box since it leads to oversubscription. Moreover, it may well not be the best option depending on the number of concurrent connections this arrow process is going to be making to the AWS api: quoth [the documentation](https://awslabs.github.io/aws-crt-cpp/class_aws_1_1_crt_1_1_io_1_1_event_loop_group.html) > The number of threads used depends on your use-case. IF you have a maximum of less than a few hundred connections 1 thread is the ideal threadCount. It would be nice if there were a way to control the size of this thread pool in the same way that one can control the number of IO threads arrow uses using `ARROW_IO_THREADS`. [Aside: AFAICT there's no programmatic way of control _arrow's_ thread pool size, it must be done via environment variables, which is also rather unfriendly]. I think the following diff is kind of a sketch in this direction, although it just unilaterally sets the size of the thread pool available to a single thread. ```patch diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc index 16ffe2526..a71ec93f7 100644 --- a/cpp/src/arrow/filesystem/s3fs.cc +++ b/cpp/src/arrow/filesystem/s3fs.cc @@ -2604,6 +2604,13 @@ Status DoInitializeS3(const S3GlobalOptions& options) { // This configuration options is only available with AWS SDK 1.9.272 and later. aws_options.httpOptions.compliantRfc3986Encoding = true; #endif + aws_options.ioOptions.clientBootstrap_create_fn = []() { + Aws::Crt::Io::EventLoopGroup eventLoopGroup(1); + Aws::Crt::Io::DefaultHostResolver defaultHostResolver(eventLoopGroup, 8, 30); + auto clientBootstrap = Aws::MakeShared<Aws::Crt::Io::ClientBootstrap>(ALLOCATION_TAG, eventLoopGroup, defaultHostResolver); + clientBootstrap->EnableBlockingShutdown(); + return clientBootstrap; + }; Aws::InitAPI(aws_options); aws_initialized.store(true); return Status::OK(); ``` I'm not really familiar with the arrow layout so I don't know how to plumb this in to any configuration options that might abound: is there such a thing or would one just introduce a (new?) env var? ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org