rajarshisarkar opened a new pull request, #7194:
URL: https://github.com/apache/iceberg/pull/7194
This PR uses `MetricsReporter` post commit notifications to auto optimize
tables. The solution lets users to collect table activities during writes and
make better decisions on how to optimize each table differently.
The overall approach is to form the rewrite data files SQL command and
submit it to EMR-on-EC2 or Athena after write operations. Users can use either
of the two implementations (via table or catalog properties):
`RewriteUsingEMREC2` or `RewriteUsingAthena` to rewrite the tables when some
defined thresholds are met (commit based or time based).
Looking forward to the community feedback.
---
Command to launch session:
```
spark-sql --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
\
--conf
spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf
spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
\
--conf
spark.sql.catalog.my_catalog.warehouse=s3://bucket/warehouse/test-table \
--conf
spark.sql.catalog.my_catalog.metrics-reporter-impl=org.apache.iceberg.aws.reporter.OptimizeTableReporter
\
--conf
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.synchronous.enabled=true
\
--conf
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.impl=org.apache.iceberg.aws.emr.RewriteUsingEMREC2
\
--conf
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.emr.cluster-id=j-3QLCP3UJJ7IOZ
\
--conf
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.commit.threshold=1
```
```
spark-sql --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
\
--conf
spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf
spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
\
--conf
spark.sql.catalog.my_catalog.warehouse=s3://bucket/warehouse/test-table \
--conf
spark.sql.catalog.my_catalog.metrics-reporter-impl=org.apache.iceberg.aws.reporter.OptimizeTableReporter
\
--conf
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.synchronous.enabled=true
\
--conf
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.impl=org.apache.iceberg.aws.athena.RewriteUsingAthena
\
--conf
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.emr.cluster-id=j-3QLCP3UJJ7IOZ
\
--conf
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.commit.threshold=1
\
--conf
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.athena.output-bucket=s3://bucket
```
---
cc: @jackye1995 @singhpk234 @amogh-jahagirdar
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]