HonahX commented on code in PR #12503: URL: https://github.com/apache/iceberg/pull/12503#discussion_r1992593830
########## docs/docs/aws.md: ########## @@ -565,6 +565,81 @@ spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCata For more details on using S3 Acceleration, please refer to [Configuring fast, secure file transfers using Amazon S3 Transfer Acceleration](https://docs.aws.amazon.com/AmazonS3/latest/userguide/transfer-acceleration.html). +### S3 Analytics Accelerator + +The [Analytics Accelerator Library for Amazon S3](https://github.com/awslabs/analytics-accelerator-s3) helps you accelerate access to Amazon S3 data from your applications. This open-source solution reduces processing times and compute costs for your data analytics workloads. + +In order to enable S3 Analytics Accelerator Library to work in Iceberg, you can set the `s3.analytics-accelerator.enabled` catalog property to `true`. By default, this property is set to `false`. + +For example, to use S3 Analytics Accelerator with Spark, you can start the Spark SQL shell with: +``` +spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket2/my/key/prefix \ + --conf spark.sql.catalog.my_catalog.type=glue \ + --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ + --conf spark.sql.catalog.my_catalog.s3.analytics-accelerator.enabled=true +``` + +The Analytics Accelerator Library can work with either the [S3 CRT client](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/crt-based-s3-client.html) or the [S3AsyncClient](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/S3AsyncClient.html). The library recommends that you use the S3 CRT client due to its enhanced connection pool management and [higher throughput on downloads](https://aws.amazon.com/blogs/developer/introducing-crt-based-s3-client-and-the-s3-transfer-manager-in-the-aws-sdk-for-java-2-x/). + +##### Client Configuration Review Comment: ```suggestion #### Client Configuration ``` Shall we use H4 heading here since the `S3 Analytics Accelerator` is H3? ########## aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java: ########## @@ -98,6 +100,17 @@ public class S3FileIOProperties implements Serializable { public static final boolean S3_CRT_ENABLED_DEFAULT = true; + /** This property is used to specify the max concurrency for CRT Async clients. */ + public static final String S3_CRT_MAX_CONCURRENCY = "s3.crt.max-concurrency"; + + /** + * To fully benefit from the analytics-accelerator-s3 library where this S3 CRT client is used, it + * is recommended to initialize with higher concurrency. + * + * <p>For more details, see: https://github.com/awslabs/analytics-accelerator-s3 + */ + public static final int S3_CRT_MAX_CONCURRENCY_DEFAULT = 500; Review Comment: Could we add a test for the new property? Currently we have https://github.com/apache/iceberg/blob/3dba6afb789a420373e44ac29ecdef866bd7ebee/aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIOProperties.java#L55 for default values and https://github.com/apache/iceberg/blob/3dba6afb789a420373e44ac29ecdef866bd7ebee/aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIOProperties.java#L128 for custom values ########## aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java: ########## @@ -1098,6 +1150,10 @@ public <T extends S3ClientBuilder> void applyUserAgentConfigurations(T builder) .build()); } + public S3CrtAsyncClientBuilder applyS3CrtConfigurations(S3CrtAsyncClientBuilder builder) { + return builder.maxConcurrency(s3CrtMaxConcurrency()); + } + Review Comment: Could we add mock test similar to https://github.com/apache/iceberg/blob/3dba6afb789a420373e44ac29ecdef866bd7ebee/aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIOProperties.java#L487-L495 ? ########## aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java: ########## @@ -98,6 +100,17 @@ public class S3FileIOProperties implements Serializable { public static final boolean S3_CRT_ENABLED_DEFAULT = true; + /** This property is used to specify the max concurrency for CRT Async clients. */ Review Comment: Currently, we have inconsistencies in naming: `CRT Async Client`, `S3 CRT client`, `CRT client`, and `S3 CRT Async client`. How about unifying the terminology in the docstring for the CRT-based S3 client? Maybe we standardize it as `S3 CRT client` everywhere? ########## aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java: ########## @@ -1004,6 +1026,36 @@ public <T extends S3ClientBuilder> void applyEndpointConfigurations(T builder) { } } + /** + * Override the endpoint for an S3 async client + * + * <p>Sample usage: + * + * <pre> + * S3AsyncClient.builder().applyMutation(s3FileIOProperties::applyEndpointConfigurations) + * </pre> + */ + public <T extends S3AsyncClientBuilder> void applyEndpointConfigurations(T builder) { + if (endpoint != null) { + builder.endpointOverride(URI.create(endpoint)); + } + } + + /** + * Override the endpoint for an S3 CRT async client + * + * <p>Sample usage: + * + * <pre> + * S3AsyncClient.crtBuilder().applyMutation(s3FileIOProperties::applyEndpointConfigurations) + * </pre> + */ + public <T extends S3CrtAsyncClientBuilder> void applyEndpointConfigurations(T builder) { + if (endpoint != null) { + builder.endpointOverride(URI.create(endpoint)); + } + } Review Comment: Could we add mock test for S3AsyncClient and S3SrcAsyncClientBuilder in https://github.com/apache/iceberg/blob/3dba6afb789a420373e44ac29ecdef866bd7ebee/aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIOProperties.java#L487-L494 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org