[GitHub] [iceberg] rajarshisarkar opened a new pull request, #7194: Core, AWS: Auto optimize table using post commit notifications

via GitHub Fri, 24 Mar 2023 03:32:08 -0700


rajarshisarkar opened a new pull request, #7194:
URL: https://github.com/apache/iceberg/pull/7194


   This PR uses `MetricsReporter` post commit notifications to auto optimize 
tables. The solution lets users to collect table activities during writes and 
make better decisions on how to optimize each table differently.
   
   The overall approach is to form the rewrite data files SQL command and 
submit it to EMR-on-EC2 or Athena after write operations. Users can use either 
of the two implementations (via table or catalog properties): 
`RewriteUsingEMREC2` or `RewriteUsingAthena` to rewrite the tables when some 
defined thresholds are met (commit based or time based).
   
   Looking forward to the community feedback.
   
   ---
   
   Command to launch session:
   ```
   spark-sql --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
 \
       --conf 
spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
       --conf 
spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
 \
       --conf 
spark.sql.catalog.my_catalog.warehouse=s3://bucket/warehouse/test-table \
       --conf 
spark.sql.catalog.my_catalog.metrics-reporter-impl=org.apache.iceberg.aws.reporter.OptimizeTableReporter
 \
       --conf 
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.synchronous.enabled=true
 \
       --conf 
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.impl=org.apache.iceberg.aws.emr.RewriteUsingEMREC2
 \
       --conf 
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.emr.cluster-id=j-3QLCP3UJJ7IOZ
 \
       --conf 
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.commit.threshold=1
   ```
   
   ```
   spark-sql --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
 \
       --conf 
spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
       --conf 
spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
 \
       --conf 
spark.sql.catalog.my_catalog.warehouse=s3://bucket/warehouse/test-table \
       --conf 
spark.sql.catalog.my_catalog.metrics-reporter-impl=org.apache.iceberg.aws.reporter.OptimizeTableReporter
 \
       --conf 
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.synchronous.enabled=true
 \
       --conf 
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.impl=org.apache.iceberg.aws.athena.RewriteUsingAthena
 \
       --conf 
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.emr.cluster-id=j-3QLCP3UJJ7IOZ
 \
       --conf 
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.commit.threshold=1
 \
       --conf 
spark.sql.catalog.my_catalog.auto.optimize.rewrite-data-files.athena.output-bucket=s3://bucket
   ```
   
   ---
   cc: @jackye1995 @singhpk234 @amogh-jahagirdar 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rajarshisarkar opened a new pull request, #7194: Core, AWS: Auto optimize table using post commit notifications

Reply via email to