HonahX commented on code in PR #12503: URL: https://github.com/apache/iceberg/pull/12503#discussion_r1994096171
########## docs/docs/aws.md: ########## @@ -565,6 +565,81 @@ spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCata For more details on using S3 Acceleration, please refer to [Configuring fast, secure file transfers using Amazon S3 Transfer Acceleration](https://docs.aws.amazon.com/AmazonS3/latest/userguide/transfer-acceleration.html). +### S3 Analytics Accelerator + +The [Analytics Accelerator Library for Amazon S3](https://github.com/awslabs/analytics-accelerator-s3) helps you accelerate access to Amazon S3 data from your applications. This open-source solution reduces processing times and compute costs for your data analytics workloads. + +In order to enable S3 Analytics Accelerator Library to work in Iceberg, you can set the `s3.analytics-accelerator.enabled` catalog property to `true`. By default, this property is set to `false`. + +For example, to use S3 Analytics Accelerator with Spark, you can start the Spark SQL shell with: +``` +spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket2/my/key/prefix \ + --conf spark.sql.catalog.my_catalog.type=glue \ + --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ + --conf spark.sql.catalog.my_catalog.s3.analytics-accelerator.enabled=true +``` + +The Analytics Accelerator Library can work with either the [S3 CRT client](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/crt-based-s3-client.html) or the [S3AsyncClient](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/S3AsyncClient.html). The library recommends that you use the S3 CRT client due to its enhanced connection pool management and [higher throughput on downloads](https://aws.amazon.com/blogs/developer/introducing-crt-based-s3-client-and-the-s3-transfer-manager-in-the-aws-sdk-for-java-2-x/). + +#### Client Configuration + +| Property | Default | Description | +|------------------------|---------|--------------------------------------------------------------| +| s3.crt.enabled | `true` | Controls if the S3 Async clients should be created using CRT | +| s3.crt.max-concurrency | `500` | Max concurrency for S3 CRT clients | + +Additional library specific configurations are organized into the following sections: + +##### Logical IO Configuration Review Comment: ```suggestion #### Logical IO Configuration ``` ########## aws/src/main/java/org/apache/iceberg/aws/AwsClientProperties.java: ########## @@ -135,6 +136,21 @@ public <T extends AwsClientBuilder> void applyClientRegionConfiguration(T builde } } + /** + * Configure a S3 CRT client region. Review Comment: ```suggestion * Configure an S3 CRT client region. ``` ########## docs/docs/aws.md: ########## @@ -565,6 +565,81 @@ spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCata For more details on using S3 Acceleration, please refer to [Configuring fast, secure file transfers using Amazon S3 Transfer Acceleration](https://docs.aws.amazon.com/AmazonS3/latest/userguide/transfer-acceleration.html). +### S3 Analytics Accelerator + +The [Analytics Accelerator Library for Amazon S3](https://github.com/awslabs/analytics-accelerator-s3) helps you accelerate access to Amazon S3 data from your applications. This open-source solution reduces processing times and compute costs for your data analytics workloads. + +In order to enable S3 Analytics Accelerator Library to work in Iceberg, you can set the `s3.analytics-accelerator.enabled` catalog property to `true`. By default, this property is set to `false`. + +For example, to use S3 Analytics Accelerator with Spark, you can start the Spark SQL shell with: +``` +spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket2/my/key/prefix \ + --conf spark.sql.catalog.my_catalog.type=glue \ + --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ + --conf spark.sql.catalog.my_catalog.s3.analytics-accelerator.enabled=true +``` + +The Analytics Accelerator Library can work with either the [S3 CRT client](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/crt-based-s3-client.html) or the [S3AsyncClient](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/S3AsyncClient.html). The library recommends that you use the S3 CRT client due to its enhanced connection pool management and [higher throughput on downloads](https://aws.amazon.com/blogs/developer/introducing-crt-based-s3-client-and-the-s3-transfer-manager-in-the-aws-sdk-for-java-2-x/). + +#### Client Configuration + +| Property | Default | Description | +|------------------------|---------|--------------------------------------------------------------| +| s3.crt.enabled | `true` | Controls if the S3 Async clients should be created using CRT | +| s3.crt.max-concurrency | `500` | Max concurrency for S3 CRT clients | + +Additional library specific configurations are organized into the following sections: + +##### Logical IO Configuration + +| Property | Default | Description | +|------------------------------------------------------------------------|-----------------------|----------------------------------------------------------------------------| +| s3.analytics-accelerator.logicalio.prefetch.footer.enabled | `true` | Controls whether footer prefetching is enabled | +| s3.analytics-accelerator.logicalio.prefetch.page.index.enabled | `true` | Controls whether page index prefetching is enabled | +| s3.analytics-accelerator.logicalio.prefetch.file.metadata.size | `32KB` | Size of metadata to prefetch for regular files | +| s3.analytics-accelerator.logicalio.prefetch.large.file.metadata.size | `1MB` | Size of metadata to prefetch for large files | +| s3.analytics-accelerator.logicalio.prefetch.file.page.index.size | `1MB` | Size of page index to prefetch for regular files | +| s3.analytics-accelerator.logicalio.prefetch.large.file.page.index.size | `8MB` | Size of page index to prefetch for large files | +| s3.analytics-accelerator.logicalio.large.file.size | `1GB` | Threshold to consider a file as large | +| s3.analytics-accelerator.logicalio.small.objects.prefetching.enabled | `true` | Controls prefetching for small objects | +| s3.analytics-accelerator.logicalio.small.object.size.threshold | `3MB` | Size threshold for small object prefetching | +| s3.analytics-accelerator.logicalio.parquet.metadata.store.size | `45` | Size of the parquet metadata store | +| s3.analytics-accelerator.logicalio.max.column.access.store.size | `15` | Maximum size of column access store | +| s3.analytics-accelerator.logicalio.parquet.format.selector.regex | `^.*.(parquet\|par)$` | Regex pattern to identify parquet files | +| s3.analytics-accelerator.logicalio.prefetching.mode | `ROW_GROUP` | Prefetching mode (valid values: `OFF`, `ALL`, `ROW_GROUP`, `COLUMN_BOUND`) | + +##### Physical IO Configuration + +| Property | Default | Description | +|--------------------------------------------------------------|---------|---------------------------------------------| +| s3.analytics-accelerator.physicalio.metadatastore.capacity | `50` | Capacity of the metadata store | +| s3.analytics-accelerator.physicalio.blocksizebytes | `8MB` | Size of blocks for data transfer | +| s3.analytics-accelerator.physicalio.readaheadbytes | `64KB` | Number of bytes to read ahead | +| s3.analytics-accelerator.physicalio.maxrangesizebytes | `8MB` | Maximum size of range requests | +| s3.analytics-accelerator.physicalio.partsizebytes | `8MB` | Size of individual parts for transfer | +| s3.analytics-accelerator.physicalio.sequentialprefetch.base | `2.0` | Base factor for sequential prefetch sizing | +| s3.analytics-accelerator.physicalio.sequentialprefetch.speed | `1.0` | Speed factor for sequential prefetch growth | + +##### Telemetry Configuration + +| Property | Default | Description | +|------------------------------------------------------------------------|-------------------------------------|--------------------------------------------------------------------------| +| s3.analytics-accelerator.telemetry.level | `STANDARD` | Telemetry detail level (valid values: `CRITICAL`, `STANDARD`, `VERBOSE`) | +| s3.analytics-accelerator.telemetry.std.out.enabled | `false` | Enable stdout telemetry output | +| s3.analytics-accelerator.telemetry.logging.enabled | `true` | Enable logging telemetry output | +| s3.analytics-accelerator.telemetry.aggregations.enabled | `false` | Enable telemetry aggregations | +| s3.analytics-accelerator.telemetry.aggregations.flush.interval.seconds | `-1` | Interval to flush aggregated telemetry | +| s3.analytics-accelerator.telemetry.logging.level | `INFO` | Log level for telemetry | +| s3.analytics-accelerator.telemetry.logging.name | `com.amazon.connector.s3.telemetry` | Logger name for telemetry | +| s3.analytics-accelerator.telemetry.format | `default` | Telemetry output format (valid values: `json`, `default`) | + +##### Object Client Configuration Review Comment: ```suggestion #### Object Client Configuration ``` ########## docs/docs/aws.md: ########## @@ -565,6 +565,81 @@ spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCata For more details on using S3 Acceleration, please refer to [Configuring fast, secure file transfers using Amazon S3 Transfer Acceleration](https://docs.aws.amazon.com/AmazonS3/latest/userguide/transfer-acceleration.html). +### S3 Analytics Accelerator + +The [Analytics Accelerator Library for Amazon S3](https://github.com/awslabs/analytics-accelerator-s3) helps you accelerate access to Amazon S3 data from your applications. This open-source solution reduces processing times and compute costs for your data analytics workloads. + +In order to enable S3 Analytics Accelerator Library to work in Iceberg, you can set the `s3.analytics-accelerator.enabled` catalog property to `true`. By default, this property is set to `false`. + +For example, to use S3 Analytics Accelerator with Spark, you can start the Spark SQL shell with: +``` +spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket2/my/key/prefix \ + --conf spark.sql.catalog.my_catalog.type=glue \ + --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ + --conf spark.sql.catalog.my_catalog.s3.analytics-accelerator.enabled=true +``` + +The Analytics Accelerator Library can work with either the [S3 CRT client](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/crt-based-s3-client.html) or the [S3AsyncClient](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/S3AsyncClient.html). The library recommends that you use the S3 CRT client due to its enhanced connection pool management and [higher throughput on downloads](https://aws.amazon.com/blogs/developer/introducing-crt-based-s3-client-and-the-s3-transfer-manager-in-the-aws-sdk-for-java-2-x/). + +#### Client Configuration + +| Property | Default | Description | +|------------------------|---------|--------------------------------------------------------------| +| s3.crt.enabled | `true` | Controls if the S3 Async clients should be created using CRT | +| s3.crt.max-concurrency | `500` | Max concurrency for S3 CRT clients | + +Additional library specific configurations are organized into the following sections: + +##### Logical IO Configuration + +| Property | Default | Description | +|------------------------------------------------------------------------|-----------------------|----------------------------------------------------------------------------| +| s3.analytics-accelerator.logicalio.prefetch.footer.enabled | `true` | Controls whether footer prefetching is enabled | +| s3.analytics-accelerator.logicalio.prefetch.page.index.enabled | `true` | Controls whether page index prefetching is enabled | +| s3.analytics-accelerator.logicalio.prefetch.file.metadata.size | `32KB` | Size of metadata to prefetch for regular files | +| s3.analytics-accelerator.logicalio.prefetch.large.file.metadata.size | `1MB` | Size of metadata to prefetch for large files | +| s3.analytics-accelerator.logicalio.prefetch.file.page.index.size | `1MB` | Size of page index to prefetch for regular files | +| s3.analytics-accelerator.logicalio.prefetch.large.file.page.index.size | `8MB` | Size of page index to prefetch for large files | +| s3.analytics-accelerator.logicalio.large.file.size | `1GB` | Threshold to consider a file as large | +| s3.analytics-accelerator.logicalio.small.objects.prefetching.enabled | `true` | Controls prefetching for small objects | +| s3.analytics-accelerator.logicalio.small.object.size.threshold | `3MB` | Size threshold for small object prefetching | +| s3.analytics-accelerator.logicalio.parquet.metadata.store.size | `45` | Size of the parquet metadata store | +| s3.analytics-accelerator.logicalio.max.column.access.store.size | `15` | Maximum size of column access store | +| s3.analytics-accelerator.logicalio.parquet.format.selector.regex | `^.*.(parquet\|par)$` | Regex pattern to identify parquet files | +| s3.analytics-accelerator.logicalio.prefetching.mode | `ROW_GROUP` | Prefetching mode (valid values: `OFF`, `ALL`, `ROW_GROUP`, `COLUMN_BOUND`) | + +##### Physical IO Configuration + +| Property | Default | Description | +|--------------------------------------------------------------|---------|---------------------------------------------| +| s3.analytics-accelerator.physicalio.metadatastore.capacity | `50` | Capacity of the metadata store | +| s3.analytics-accelerator.physicalio.blocksizebytes | `8MB` | Size of blocks for data transfer | +| s3.analytics-accelerator.physicalio.readaheadbytes | `64KB` | Number of bytes to read ahead | +| s3.analytics-accelerator.physicalio.maxrangesizebytes | `8MB` | Maximum size of range requests | +| s3.analytics-accelerator.physicalio.partsizebytes | `8MB` | Size of individual parts for transfer | +| s3.analytics-accelerator.physicalio.sequentialprefetch.base | `2.0` | Base factor for sequential prefetch sizing | +| s3.analytics-accelerator.physicalio.sequentialprefetch.speed | `1.0` | Speed factor for sequential prefetch growth | + +##### Telemetry Configuration Review Comment: ```suggestion #### Telemetry Configuration ``` ########## docs/docs/aws.md: ########## @@ -565,6 +565,81 @@ spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCata For more details on using S3 Acceleration, please refer to [Configuring fast, secure file transfers using Amazon S3 Transfer Acceleration](https://docs.aws.amazon.com/AmazonS3/latest/userguide/transfer-acceleration.html). +### S3 Analytics Accelerator + +The [Analytics Accelerator Library for Amazon S3](https://github.com/awslabs/analytics-accelerator-s3) helps you accelerate access to Amazon S3 data from your applications. This open-source solution reduces processing times and compute costs for your data analytics workloads. + +In order to enable S3 Analytics Accelerator Library to work in Iceberg, you can set the `s3.analytics-accelerator.enabled` catalog property to `true`. By default, this property is set to `false`. + +For example, to use S3 Analytics Accelerator with Spark, you can start the Spark SQL shell with: +``` +spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket2/my/key/prefix \ + --conf spark.sql.catalog.my_catalog.type=glue \ + --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ + --conf spark.sql.catalog.my_catalog.s3.analytics-accelerator.enabled=true +``` + +The Analytics Accelerator Library can work with either the [S3 CRT client](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/crt-based-s3-client.html) or the [S3AsyncClient](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/S3AsyncClient.html). The library recommends that you use the S3 CRT client due to its enhanced connection pool management and [higher throughput on downloads](https://aws.amazon.com/blogs/developer/introducing-crt-based-s3-client-and-the-s3-transfer-manager-in-the-aws-sdk-for-java-2-x/). + +#### Client Configuration + +| Property | Default | Description | +|------------------------|---------|--------------------------------------------------------------| +| s3.crt.enabled | `true` | Controls if the S3 Async clients should be created using CRT | +| s3.crt.max-concurrency | `500` | Max concurrency for S3 CRT clients | + +Additional library specific configurations are organized into the following sections: + +##### Logical IO Configuration + +| Property | Default | Description | +|------------------------------------------------------------------------|-----------------------|----------------------------------------------------------------------------| +| s3.analytics-accelerator.logicalio.prefetch.footer.enabled | `true` | Controls whether footer prefetching is enabled | +| s3.analytics-accelerator.logicalio.prefetch.page.index.enabled | `true` | Controls whether page index prefetching is enabled | +| s3.analytics-accelerator.logicalio.prefetch.file.metadata.size | `32KB` | Size of metadata to prefetch for regular files | +| s3.analytics-accelerator.logicalio.prefetch.large.file.metadata.size | `1MB` | Size of metadata to prefetch for large files | +| s3.analytics-accelerator.logicalio.prefetch.file.page.index.size | `1MB` | Size of page index to prefetch for regular files | +| s3.analytics-accelerator.logicalio.prefetch.large.file.page.index.size | `8MB` | Size of page index to prefetch for large files | +| s3.analytics-accelerator.logicalio.large.file.size | `1GB` | Threshold to consider a file as large | +| s3.analytics-accelerator.logicalio.small.objects.prefetching.enabled | `true` | Controls prefetching for small objects | +| s3.analytics-accelerator.logicalio.small.object.size.threshold | `3MB` | Size threshold for small object prefetching | +| s3.analytics-accelerator.logicalio.parquet.metadata.store.size | `45` | Size of the parquet metadata store | +| s3.analytics-accelerator.logicalio.max.column.access.store.size | `15` | Maximum size of column access store | +| s3.analytics-accelerator.logicalio.parquet.format.selector.regex | `^.*.(parquet\|par)$` | Regex pattern to identify parquet files | +| s3.analytics-accelerator.logicalio.prefetching.mode | `ROW_GROUP` | Prefetching mode (valid values: `OFF`, `ALL`, `ROW_GROUP`, `COLUMN_BOUND`) | + +##### Physical IO Configuration Review Comment: ```suggestion #### Physical IO Configuration ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org