Re: [PR] AWS: Update S3 async client configurations and docs for analytics-accelerator-s3 [iceberg]

via GitHub Thu, 13 Mar 2025 11:26:54 -0700


HonahX commented on code in PR #12503:
URL: https://github.com/apache/iceberg/pull/12503#discussion_r1994096171



##########
docs/docs/aws.md:
##########
@@ -565,6 +565,81 @@ spark-sql --conf 
spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCata
 
 For more details on using S3 Acceleration, please refer to [Configuring fast, 
secure file transfers using Amazon S3 Transfer 
Acceleration](https://docs.aws.amazon.com/AmazonS3/latest/userguide/transfer-acceleration.html).
 
+### S3 Analytics Accelerator
+
+The [Analytics Accelerator Library for Amazon 
S3](https://github.com/awslabs/analytics-accelerator-s3) helps you accelerate 
access to Amazon S3 data from your applications. This open-source solution 
reduces processing times and compute costs for your data analytics workloads.
+
+In order to enable S3 Analytics Accelerator Library to work in Iceberg, you 
can set the `s3.analytics-accelerator.enabled` catalog property to `true`. By 
default, this property is set to `false`.
+
+For example, to use S3 Analytics Accelerator with Spark, you can start the 
Spark SQL shell with:
+```
+spark-sql --conf 
spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
+    --conf 
spark.sql.catalog.my_catalog.warehouse=s3://my-bucket2/my/key/prefix \
+    --conf spark.sql.catalog.my_catalog.type=glue \
+    --conf 
spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
+    --conf spark.sql.catalog.my_catalog.s3.analytics-accelerator.enabled=true
+```
+
+The Analytics Accelerator Library can work with either the [S3 CRT 
client](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/crt-based-s3-client.html)
 or the 
[S3AsyncClient](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/S3AsyncClient.html).
 The library recommends that you use the S3 CRT client due to its enhanced 
connection pool management and [higher throughput on 
downloads](https://aws.amazon.com/blogs/developer/introducing-crt-based-s3-client-and-the-s3-transfer-manager-in-the-aws-sdk-for-java-2-x/).
+
+#### Client Configuration
+
+| Property               | Default | Description                               
                   |
+|------------------------|---------|--------------------------------------------------------------|
+| s3.crt.enabled         | `true`  | Controls if the S3 Async clients should 
be created using CRT |
+| s3.crt.max-concurrency | `500`   | Max concurrency for S3 CRT clients        
                   |
+
+Additional library specific configurations are organized into the following 
sections:
+
+##### Logical IO Configuration

Review Comment:
   ```suggestion
   #### Logical IO Configuration
   ```



##########
aws/src/main/java/org/apache/iceberg/aws/AwsClientProperties.java:
##########
@@ -135,6 +136,21 @@ public <T extends AwsClientBuilder> void 
applyClientRegionConfiguration(T builde
     }
   }
 
+  /**
+   * Configure a S3 CRT client region.

Review Comment:
   ```suggestion
      * Configure an S3 CRT client region.
   ```



##########
docs/docs/aws.md:
##########
@@ -565,6 +565,81 @@ spark-sql --conf 
spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCata
 
 For more details on using S3 Acceleration, please refer to [Configuring fast, 
secure file transfers using Amazon S3 Transfer 
Acceleration](https://docs.aws.amazon.com/AmazonS3/latest/userguide/transfer-acceleration.html).
 
+### S3 Analytics Accelerator
+
+The [Analytics Accelerator Library for Amazon 
S3](https://github.com/awslabs/analytics-accelerator-s3) helps you accelerate 
access to Amazon S3 data from your applications. This open-source solution 
reduces processing times and compute costs for your data analytics workloads.
+
+In order to enable S3 Analytics Accelerator Library to work in Iceberg, you 
can set the `s3.analytics-accelerator.enabled` catalog property to `true`. By 
default, this property is set to `false`.
+
+For example, to use S3 Analytics Accelerator with Spark, you can start the 
Spark SQL shell with:
+```
+spark-sql --conf 
spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
+    --conf 
spark.sql.catalog.my_catalog.warehouse=s3://my-bucket2/my/key/prefix \
+    --conf spark.sql.catalog.my_catalog.type=glue \
+    --conf 
spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
+    --conf spark.sql.catalog.my_catalog.s3.analytics-accelerator.enabled=true
+```
+
+The Analytics Accelerator Library can work with either the [S3 CRT 
client](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/crt-based-s3-client.html)
 or the 
[S3AsyncClient](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/S3AsyncClient.html).
 The library recommends that you use the S3 CRT client due to its enhanced 
connection pool management and [higher throughput on 
downloads](https://aws.amazon.com/blogs/developer/introducing-crt-based-s3-client-and-the-s3-transfer-manager-in-the-aws-sdk-for-java-2-x/).
+
+#### Client Configuration
+
+| Property               | Default | Description                               
                   |
+|------------------------|---------|--------------------------------------------------------------|
+| s3.crt.enabled         | `true`  | Controls if the S3 Async clients should 
be created using CRT |
+| s3.crt.max-concurrency | `500`   | Max concurrency for S3 CRT clients        
                   |
+
+Additional library specific configurations are organized into the following 
sections:
+
+##### Logical IO Configuration
+
+| Property                                                               | 
Default               | Description                                             
                   |
+|------------------------------------------------------------------------|-----------------------|----------------------------------------------------------------------------|
+| s3.analytics-accelerator.logicalio.prefetch.footer.enabled             | 
`true`                | Controls whether footer prefetching is enabled          
                   |
+| s3.analytics-accelerator.logicalio.prefetch.page.index.enabled         | 
`true`                | Controls whether page index prefetching is enabled      
                   |
+| s3.analytics-accelerator.logicalio.prefetch.file.metadata.size         | 
`32KB`                | Size of metadata to prefetch for regular files          
                   |
+| s3.analytics-accelerator.logicalio.prefetch.large.file.metadata.size   | 
`1MB`                 | Size of metadata to prefetch for large files            
                   |
+| s3.analytics-accelerator.logicalio.prefetch.file.page.index.size       | 
`1MB`                 | Size of page index to prefetch for regular files        
                   |
+| s3.analytics-accelerator.logicalio.prefetch.large.file.page.index.size | 
`8MB`                 | Size of page index to prefetch for large files          
                   |
+| s3.analytics-accelerator.logicalio.large.file.size                     | 
`1GB`                 | Threshold to consider a file as large                   
                   |
+| s3.analytics-accelerator.logicalio.small.objects.prefetching.enabled   | 
`true`                | Controls prefetching for small objects                  
                   |
+| s3.analytics-accelerator.logicalio.small.object.size.threshold         | 
`3MB`                 | Size threshold for small object prefetching             
                   |
+| s3.analytics-accelerator.logicalio.parquet.metadata.store.size         | 
`45`                  | Size of the parquet metadata store                      
                   |
+| s3.analytics-accelerator.logicalio.max.column.access.store.size        | 
`15`                  | Maximum size of column access store                     
                   |
+| s3.analytics-accelerator.logicalio.parquet.format.selector.regex       | 
`^.*.(parquet\|par)$` | Regex pattern to identify parquet files                 
                   |
+| s3.analytics-accelerator.logicalio.prefetching.mode                    | 
`ROW_GROUP`           | Prefetching mode (valid values: `OFF`, `ALL`, 
`ROW_GROUP`, `COLUMN_BOUND`) |
+
+##### Physical IO Configuration
+
+| Property                                                     | Default | 
Description                                 |
+|--------------------------------------------------------------|---------|---------------------------------------------|
+| s3.analytics-accelerator.physicalio.metadatastore.capacity   | `50`    | 
Capacity of the metadata store              |
+| s3.analytics-accelerator.physicalio.blocksizebytes           | `8MB`   | 
Size of blocks for data transfer            |
+| s3.analytics-accelerator.physicalio.readaheadbytes           | `64KB`  | 
Number of bytes to read ahead               |
+| s3.analytics-accelerator.physicalio.maxrangesizebytes        | `8MB`   | 
Maximum size of range requests              |
+| s3.analytics-accelerator.physicalio.partsizebytes            | `8MB`   | 
Size of individual parts for transfer       |
+| s3.analytics-accelerator.physicalio.sequentialprefetch.base  | `2.0`   | 
Base factor for sequential prefetch sizing  |
+| s3.analytics-accelerator.physicalio.sequentialprefetch.speed | `1.0`   | 
Speed factor for sequential prefetch growth |
+
+##### Telemetry Configuration
+
+| Property                                                               | 
Default                             | Description                               
                               |
+|------------------------------------------------------------------------|-------------------------------------|--------------------------------------------------------------------------|
+| s3.analytics-accelerator.telemetry.level                               | 
`STANDARD`                          | Telemetry detail level (valid values: 
`CRITICAL`, `STANDARD`, `VERBOSE`) |
+| s3.analytics-accelerator.telemetry.std.out.enabled                     | 
`false`                             | Enable stdout telemetry output            
                               |
+| s3.analytics-accelerator.telemetry.logging.enabled                     | 
`true`                              | Enable logging telemetry output           
                               |
+| s3.analytics-accelerator.telemetry.aggregations.enabled                | 
`false`                             | Enable telemetry aggregations             
                               |
+| s3.analytics-accelerator.telemetry.aggregations.flush.interval.seconds | 
`-1`                                | Interval to flush aggregated telemetry    
                               |
+| s3.analytics-accelerator.telemetry.logging.level                       | 
`INFO`                              | Log level for telemetry                   
                               |
+| s3.analytics-accelerator.telemetry.logging.name                        | 
`com.amazon.connector.s3.telemetry` | Logger name for telemetry                 
                               |
+| s3.analytics-accelerator.telemetry.format                              | 
`default`                           | Telemetry output format (valid values: 
`json`, `default`)                |
+
+##### Object Client Configuration

Review Comment:
   ```suggestion
   #### Object Client Configuration
   ```



##########
docs/docs/aws.md:
##########
@@ -565,6 +565,81 @@ spark-sql --conf 
spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCata
 
 For more details on using S3 Acceleration, please refer to [Configuring fast, 
secure file transfers using Amazon S3 Transfer 
Acceleration](https://docs.aws.amazon.com/AmazonS3/latest/userguide/transfer-acceleration.html).
 
+### S3 Analytics Accelerator
+
+The [Analytics Accelerator Library for Amazon 
S3](https://github.com/awslabs/analytics-accelerator-s3) helps you accelerate 
access to Amazon S3 data from your applications. This open-source solution 
reduces processing times and compute costs for your data analytics workloads.
+
+In order to enable S3 Analytics Accelerator Library to work in Iceberg, you 
can set the `s3.analytics-accelerator.enabled` catalog property to `true`. By 
default, this property is set to `false`.
+
+For example, to use S3 Analytics Accelerator with Spark, you can start the 
Spark SQL shell with:
+```
+spark-sql --conf 
spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
+    --conf 
spark.sql.catalog.my_catalog.warehouse=s3://my-bucket2/my/key/prefix \
+    --conf spark.sql.catalog.my_catalog.type=glue \
+    --conf 
spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
+    --conf spark.sql.catalog.my_catalog.s3.analytics-accelerator.enabled=true
+```
+
+The Analytics Accelerator Library can work with either the [S3 CRT 
client](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/crt-based-s3-client.html)
 or the 
[S3AsyncClient](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/S3AsyncClient.html).
 The library recommends that you use the S3 CRT client due to its enhanced 
connection pool management and [higher throughput on 
downloads](https://aws.amazon.com/blogs/developer/introducing-crt-based-s3-client-and-the-s3-transfer-manager-in-the-aws-sdk-for-java-2-x/).
+
+#### Client Configuration
+
+| Property               | Default | Description                               
                   |
+|------------------------|---------|--------------------------------------------------------------|
+| s3.crt.enabled         | `true`  | Controls if the S3 Async clients should 
be created using CRT |
+| s3.crt.max-concurrency | `500`   | Max concurrency for S3 CRT clients        
                   |
+
+Additional library specific configurations are organized into the following 
sections:
+
+##### Logical IO Configuration
+
+| Property                                                               | 
Default               | Description                                             
                   |
+|------------------------------------------------------------------------|-----------------------|----------------------------------------------------------------------------|
+| s3.analytics-accelerator.logicalio.prefetch.footer.enabled             | 
`true`                | Controls whether footer prefetching is enabled          
                   |
+| s3.analytics-accelerator.logicalio.prefetch.page.index.enabled         | 
`true`                | Controls whether page index prefetching is enabled      
                   |
+| s3.analytics-accelerator.logicalio.prefetch.file.metadata.size         | 
`32KB`                | Size of metadata to prefetch for regular files          
                   |
+| s3.analytics-accelerator.logicalio.prefetch.large.file.metadata.size   | 
`1MB`                 | Size of metadata to prefetch for large files            
                   |
+| s3.analytics-accelerator.logicalio.prefetch.file.page.index.size       | 
`1MB`                 | Size of page index to prefetch for regular files        
                   |
+| s3.analytics-accelerator.logicalio.prefetch.large.file.page.index.size | 
`8MB`                 | Size of page index to prefetch for large files          
                   |
+| s3.analytics-accelerator.logicalio.large.file.size                     | 
`1GB`                 | Threshold to consider a file as large                   
                   |
+| s3.analytics-accelerator.logicalio.small.objects.prefetching.enabled   | 
`true`                | Controls prefetching for small objects                  
                   |
+| s3.analytics-accelerator.logicalio.small.object.size.threshold         | 
`3MB`                 | Size threshold for small object prefetching             
                   |
+| s3.analytics-accelerator.logicalio.parquet.metadata.store.size         | 
`45`                  | Size of the parquet metadata store                      
                   |
+| s3.analytics-accelerator.logicalio.max.column.access.store.size        | 
`15`                  | Maximum size of column access store                     
                   |
+| s3.analytics-accelerator.logicalio.parquet.format.selector.regex       | 
`^.*.(parquet\|par)$` | Regex pattern to identify parquet files                 
                   |
+| s3.analytics-accelerator.logicalio.prefetching.mode                    | 
`ROW_GROUP`           | Prefetching mode (valid values: `OFF`, `ALL`, 
`ROW_GROUP`, `COLUMN_BOUND`) |
+
+##### Physical IO Configuration
+
+| Property                                                     | Default | 
Description                                 |
+|--------------------------------------------------------------|---------|---------------------------------------------|
+| s3.analytics-accelerator.physicalio.metadatastore.capacity   | `50`    | 
Capacity of the metadata store              |
+| s3.analytics-accelerator.physicalio.blocksizebytes           | `8MB`   | 
Size of blocks for data transfer            |
+| s3.analytics-accelerator.physicalio.readaheadbytes           | `64KB`  | 
Number of bytes to read ahead               |
+| s3.analytics-accelerator.physicalio.maxrangesizebytes        | `8MB`   | 
Maximum size of range requests              |
+| s3.analytics-accelerator.physicalio.partsizebytes            | `8MB`   | 
Size of individual parts for transfer       |
+| s3.analytics-accelerator.physicalio.sequentialprefetch.base  | `2.0`   | 
Base factor for sequential prefetch sizing  |
+| s3.analytics-accelerator.physicalio.sequentialprefetch.speed | `1.0`   | 
Speed factor for sequential prefetch growth |
+
+##### Telemetry Configuration

Review Comment:
   ```suggestion
   #### Telemetry Configuration
   ```



##########
docs/docs/aws.md:
##########
@@ -565,6 +565,81 @@ spark-sql --conf 
spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCata
 
 For more details on using S3 Acceleration, please refer to [Configuring fast, 
secure file transfers using Amazon S3 Transfer 
Acceleration](https://docs.aws.amazon.com/AmazonS3/latest/userguide/transfer-acceleration.html).
 
+### S3 Analytics Accelerator
+
+The [Analytics Accelerator Library for Amazon 
S3](https://github.com/awslabs/analytics-accelerator-s3) helps you accelerate 
access to Amazon S3 data from your applications. This open-source solution 
reduces processing times and compute costs for your data analytics workloads.
+
+In order to enable S3 Analytics Accelerator Library to work in Iceberg, you 
can set the `s3.analytics-accelerator.enabled` catalog property to `true`. By 
default, this property is set to `false`.
+
+For example, to use S3 Analytics Accelerator with Spark, you can start the 
Spark SQL shell with:
+```
+spark-sql --conf 
spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
+    --conf 
spark.sql.catalog.my_catalog.warehouse=s3://my-bucket2/my/key/prefix \
+    --conf spark.sql.catalog.my_catalog.type=glue \
+    --conf 
spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
+    --conf spark.sql.catalog.my_catalog.s3.analytics-accelerator.enabled=true
+```
+
+The Analytics Accelerator Library can work with either the [S3 CRT 
client](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/crt-based-s3-client.html)
 or the 
[S3AsyncClient](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/S3AsyncClient.html).
 The library recommends that you use the S3 CRT client due to its enhanced 
connection pool management and [higher throughput on 
downloads](https://aws.amazon.com/blogs/developer/introducing-crt-based-s3-client-and-the-s3-transfer-manager-in-the-aws-sdk-for-java-2-x/).
+
+#### Client Configuration
+
+| Property               | Default | Description                               
                   |
+|------------------------|---------|--------------------------------------------------------------|
+| s3.crt.enabled         | `true`  | Controls if the S3 Async clients should 
be created using CRT |
+| s3.crt.max-concurrency | `500`   | Max concurrency for S3 CRT clients        
                   |
+
+Additional library specific configurations are organized into the following 
sections:
+
+##### Logical IO Configuration
+
+| Property                                                               | 
Default               | Description                                             
                   |
+|------------------------------------------------------------------------|-----------------------|----------------------------------------------------------------------------|
+| s3.analytics-accelerator.logicalio.prefetch.footer.enabled             | 
`true`                | Controls whether footer prefetching is enabled          
                   |
+| s3.analytics-accelerator.logicalio.prefetch.page.index.enabled         | 
`true`                | Controls whether page index prefetching is enabled      
                   |
+| s3.analytics-accelerator.logicalio.prefetch.file.metadata.size         | 
`32KB`                | Size of metadata to prefetch for regular files          
                   |
+| s3.analytics-accelerator.logicalio.prefetch.large.file.metadata.size   | 
`1MB`                 | Size of metadata to prefetch for large files            
                   |
+| s3.analytics-accelerator.logicalio.prefetch.file.page.index.size       | 
`1MB`                 | Size of page index to prefetch for regular files        
                   |
+| s3.analytics-accelerator.logicalio.prefetch.large.file.page.index.size | 
`8MB`                 | Size of page index to prefetch for large files          
                   |
+| s3.analytics-accelerator.logicalio.large.file.size                     | 
`1GB`                 | Threshold to consider a file as large                   
                   |
+| s3.analytics-accelerator.logicalio.small.objects.prefetching.enabled   | 
`true`                | Controls prefetching for small objects                  
                   |
+| s3.analytics-accelerator.logicalio.small.object.size.threshold         | 
`3MB`                 | Size threshold for small object prefetching             
                   |
+| s3.analytics-accelerator.logicalio.parquet.metadata.store.size         | 
`45`                  | Size of the parquet metadata store                      
                   |
+| s3.analytics-accelerator.logicalio.max.column.access.store.size        | 
`15`                  | Maximum size of column access store                     
                   |
+| s3.analytics-accelerator.logicalio.parquet.format.selector.regex       | 
`^.*.(parquet\|par)$` | Regex pattern to identify parquet files                 
                   |
+| s3.analytics-accelerator.logicalio.prefetching.mode                    | 
`ROW_GROUP`           | Prefetching mode (valid values: `OFF`, `ALL`, 
`ROW_GROUP`, `COLUMN_BOUND`) |
+
+##### Physical IO Configuration

Review Comment:
   ```suggestion
   #### Physical IO Configuration
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] AWS: Update S3 async client configurations and docs for analytics-accelerator-s3 [iceberg]

Reply via email to