[I] Spark RewriteTablePath Procedure not fully using AssumeRoleAwsClientFactory config [iceberg]

via GitHub Thu, 06 Mar 2025 13:20:49 -0800


demshar23 opened a new issue, #12471:
URL: https://github.com/apache/iceberg/issues/12471


   ### Query engine
   
   Pyspark in AWS glue
   
   ### Question
   
   I am trying to use the rewrite_table_path procedure in an AWS Glue Version 5 
pyspark job or notebook, where I am setting the spark config to assume a role 
in a cross-account (using the AssumeRoleAwsClientFactory config) to execute the 
procedure instead of the glue job execution role. When I run the procedure, it 
puts the modified s3 metadata files in the s3 metadata staging location using 
the assumed role as verified by the s3 access logs, but when it attempts to 
commit the final csv output manifest file that contains the data files and 
metadata files to be copied, it instead utilizes the glue execution role. I can 
also see via cloudtrail that it is using the assumed role to make the 
glue:GetTable API call.
   
   So it seems it is successfully using the assumed for the glue client, and 
partially for the s3 client when executing the procedure.
   
   Config: 
   
   I'm importing the iceberg-core-1.8.0, iceberg-spark-runtime-3.5_2.12-1.8.0, 
and the iceberg-aws-bundle-1.8.0 jars into the job and using the following 
spark session builder config:
   
   spark = SparkSession.builder \
       .config(f"spark.sql.catalog.{catalog_name}", 
"org.apache.iceberg.spark.SparkSessionCatalog") \
       .config(f"spark.sql.catalog.{catalog_name}.warehouse", 
f"{warehouse_path}") \
       .config(f"spark.sql.catalog.{catalog_name}.catalog-impl", 
"org.apache.iceberg.aws.glue.GlueCatalog") \
       .config(f"spark.sql.catalog.{catalog_name}.io-impl", 
"org.apache.iceberg.aws.s3.S3FileIO") \
       
.config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
 \
       .config(f"spark.sql.catalog.{catalog_name}.client.region", 
f"{aws_region}") \
       .config(f"spark.sql.catalog.{catalog_name}.glue.id", 
f"{aws_account_id}") \
       .config(f"spark.sql.catalog.{catalog_name}.client.factory", 
"org.apache.iceberg.aws.AssumeRoleAwsClientFactory") \
       .config(f"spark.sql.catalog.{catalog_name}.client.assume-role.arn", 
"{role_arn}") \
       
.config(f"spark.sql.catalog.{catalog_name}.client.assume-role.session-name", 
"{session_name}") \
       .config(f"spark.sql.catalog.{catalog_name}.client.assume-role.region", 
f"{aws_region}") \
       .getOrCreate()
   
   If i don't import the jars and simply try to query the glue table in the 
session with the same configuration, it does use the assumed role correctly to 
hit both the S3 and Glue APIs with the respective clients through the 
AssumeRoleAwsClientFactory config, so it seems to be isolated to the procedure, 
as I can run other procedures like rewrite_data_files and it correctly uses the 
assumed role to append S3 with the rewritten data files instead of the glue 
execution role. The breakdown occurs only when it is trying to place the final 
csv file into the file-list subfolder within the staging location.
   
   Any ideas on an alternative configuration that may work or is it maybe a 
confliction in the procedure itself?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Spark RewriteTablePath Procedure not fully using AssumeRoleAwsClientFactory config [iceberg]

Reply via email to