demshar23 opened a new issue, #12471: URL: https://github.com/apache/iceberg/issues/12471
### Query engine Pyspark in AWS glue ### Question I am trying to use the rewrite_table_path procedure in an AWS Glue Version 5 pyspark job or notebook, where I am setting the spark config to assume a role in a cross-account (using the AssumeRoleAwsClientFactory config) to execute the procedure instead of the glue job execution role. When I run the procedure, it puts the modified s3 metadata files in the s3 metadata staging location using the assumed role as verified by the s3 access logs, but when it attempts to commit the final csv output manifest file that contains the data files and metadata files to be copied, it instead utilizes the glue execution role. I can also see via cloudtrail that it is using the assumed role to make the glue:GetTable API call. So it seems it is successfully using the assumed for the glue client, and partially for the s3 client when executing the procedure. Config: I'm importing the iceberg-core-1.8.0, iceberg-spark-runtime-3.5_2.12-1.8.0, and the iceberg-aws-bundle-1.8.0 jars into the job and using the following spark session builder config: spark = SparkSession.builder \ .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkSessionCatalog") \ .config(f"spark.sql.catalog.{catalog_name}.warehouse", f"{warehouse_path}") \ .config(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \ .config(f"spark.sql.catalog.{catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \ .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \ .config(f"spark.sql.catalog.{catalog_name}.client.region", f"{aws_region}") \ .config(f"spark.sql.catalog.{catalog_name}.glue.id", f"{aws_account_id}") \ .config(f"spark.sql.catalog.{catalog_name}.client.factory", "org.apache.iceberg.aws.AssumeRoleAwsClientFactory") \ .config(f"spark.sql.catalog.{catalog_name}.client.assume-role.arn", "{role_arn}") \ .config(f"spark.sql.catalog.{catalog_name}.client.assume-role.session-name", "{session_name}") \ .config(f"spark.sql.catalog.{catalog_name}.client.assume-role.region", f"{aws_region}") \ .getOrCreate() If i don't import the jars and simply try to query the glue table in the session with the same configuration, it does use the assumed role correctly to hit both the S3 and Glue APIs with the respective clients through the AssumeRoleAwsClientFactory config, so it seems to be isolated to the procedure, as I can run other procedures like rewrite_data_files and it correctly uses the assumed role to append S3 with the rewritten data files instead of the glue execution role. The breakdown occurs only when it is trying to place the final csv file into the file-list subfolder within the staging location. Any ideas on an alternative configuration that may work or is it maybe a confliction in the procedure itself? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org