wence- opened a new issue, #34118:
URL: https://github.com/apache/arrow/issues/34118

   ### Describe the enhancement requested
   
   When calling `DoInitializeS3`, arrow creates initialises the AWS API, which 
by default creates a thread pool for the background AWS event loop that uses 
one thread per physical core on the system.
   
   This is particularly noticeable when using pyarrow where merely importing 
`from pyarrow.fs import PyFileSystem` will cause eager initialisation of the 
AWS API, so you're paying (something) for these threads even when you have no 
intention of using them.
   
   This is rather unfriendly when running a multi-process or some otherhow 
parallelised process on a multicore box since it leads to oversubscription. 
Moreover, it may well not be the best option depending on the number of 
concurrent connections this arrow process is going to be making to the AWS api: 
quoth [the 
documentation](https://awslabs.github.io/aws-crt-cpp/class_aws_1_1_crt_1_1_io_1_1_event_loop_group.html)
   
   > The number of threads used depends on your use-case. IF you have a maximum 
of less than a few hundred connections 1 thread is the ideal threadCount.
   
   It would be nice if there were a way to control the size of this thread pool 
in the same way that one can control the number of IO threads arrow uses using 
`ARROW_IO_THREADS`. [Aside: AFAICT there's no programmatic way of control 
_arrow's_ thread pool size, it must be done via environment variables, which is 
also rather unfriendly].
   
   I think the following diff is kind of a sketch in this direction, although 
it just unilaterally sets the size of the thread pool available to a single 
thread.
   
   ```patch
   diff --git a/cpp/src/arrow/filesystem/s3fs.cc 
b/cpp/src/arrow/filesystem/s3fs.cc
   index 16ffe2526..a71ec93f7 100644
   --- a/cpp/src/arrow/filesystem/s3fs.cc
   +++ b/cpp/src/arrow/filesystem/s3fs.cc
   @@ -2604,6 +2604,13 @@ Status DoInitializeS3(const S3GlobalOptions& options) 
{
      // This configuration options is only available with AWS SDK 1.9.272 and 
later.
      aws_options.httpOptions.compliantRfc3986Encoding = true;
    #endif
   +  aws_options.ioOptions.clientBootstrap_create_fn = []() {
   +    Aws::Crt::Io::EventLoopGroup eventLoopGroup(1);
   +    Aws::Crt::Io::DefaultHostResolver defaultHostResolver(eventLoopGroup, 
8, 30);
   +    auto clientBootstrap = 
Aws::MakeShared<Aws::Crt::Io::ClientBootstrap>(ALLOCATION_TAG, eventLoopGroup, 
defaultHostResolver);
   +    clientBootstrap->EnableBlockingShutdown();
   +    return clientBootstrap;
   +  };
      Aws::InitAPI(aws_options);
      aws_initialized.store(true);
      return Status::OK();
   ```
   
   I'm not really familiar with the arrow layout so I don't know how to plumb 
this in to any configuration options that might abound: is there such a thing 
or would one just introduce a (new?) env var?
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to