waichee opened a new issue, #9406: URL: https://github.com/apache/iceberg/issues/9406
### Apache Iceberg version 1.3.1 ### Query engine Spark ### Please describe the bug 🐞 **Setup** We use the following spark libraries to write to Iceberg on EMR: `org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1` `org.apache.iceberg:iceberg-spark-extensions-3.4_2.12:1.3.1` We have setup DynamoDb lock manager with Glue catalog for our iceberg table. Spark config ``` sparkConf .set("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog") .set("spark.sql.catalog.iceberg.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") .set("spark.sql.catalog.iceberg.lock-impl", "org.apache.iceberg.aws.dynamodb.DynamoDbLockManager") .set("spark.sql.catalog.iceberg.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") .set("spark.sql.catalog.iceberg.warehouse", warehousePath) .set("spark.sql.catalog.iceberg.lock.table", locktableName) .set("spark.sql.defaultCatalog","iceberg"); ``` table properties ``` Map<String, String> tableOptions = Map.of( "provider", CATALOG_PROVIDER, "write.parquet.compression-codec", "snappy", "write.delete.mode", "copy-on-write", "write.update.mode", "merge-on-read", "write.merge.mode", "merge-on-read", "write.spark.accept-any-schema", "true" ); ``` java code to write to Iceberg table ``` datafram .writeTo(String.format("%s.%s", database, tablename)) .option("mergeSchema", "true") .append(); ``` **What happened** We noticed the `.append` write to a table failed with the following exception. ``` Caused by: org.apache.iceberg.exceptions.CommitFailedException: Cannot commit iceberg.zdp_l1_scd2_cdc_sharddb.fraud_scores because base metadata location 's3://bucket/iceberg/db/tableName/metadata/metadata1.json' is not same as the current Glue location 's3://bucket/iceberg/db/tableName/metadata/metadata2.json` at org.apache.iceberg.aws.glue.GlueTableOperations.checkMetadataLocation(GlueTableOperations.java:272) at org.apache.iceberg.aws.glue.GlueTableOperations.doCommit(GlueTableOperations.java:158) ``` When we restart the job, the app throws 404 not found when looking up the manifest file for the Iceberg table ``` Caused by: software.amazon.awssdk.services.s3.model.NoSuchKeyException: null (Service: S3, Status Code: 404, Request ID: BGEGZNRNM1YR7MGW, Extended Request ID: qTaJ1r8Ky1vMKOAU5+r4djAQY9r05jGqvFFGLJz9WD6upWROlpIroSEefjGsw2OQI41k/pHKRRI=) (Service: S3, Status Code: 404, Request ID: BGEGZNRNM1YR7MGW) at software.amazon.awssdk.services.s3.model.NoSuchKeyException$BuilderImpl.build(NoSuchKeyException.java:126) at software.amazon.awssdk.services.s3.model.NoSuchKeyException$BuilderImpl.build(NoSuchKeyException.java:80) at software.amazon.awssdk.services.s3.internal.handlers.ExceptionTranslationInterceptor.modifyException(ExceptionTranslationInterceptor.java:63) at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.modifyException(ExecutionInterceptorChain.java:202) at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.ExceptionReportingUtils.runModifyException(ExceptionReportingUtils.java:54) at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.ExceptionReportingUtils.reportFailureToInterceptors(ExceptionReportingUtils.java:38) at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:39) at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26) at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:198) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:171) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:82) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:179) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:76) at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45) at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:56) at software.amazon.awssdk.services.s3.DefaultS3Client.headObject(DefaultS3Client.java:5495) at org.apache.iceberg.aws.s3.BaseS3File.getObjectMetadata(BaseS3File.java:85) at org.apache.iceberg.aws.s3.S3InputFile.getLength(S3InputFile.java:77) at org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:100) at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:76) at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:36) at org.apache.iceberg.relocated.com.google.common.collect.Iterables.addAll(Iterables.java:333) at org.apache.iceberg.relocated.com.google.common.collect.Lists.newLinkedList(Lists.java:241) at org.apache.iceberg.ManifestLists.read(ManifestLists.java:45) at org.apache.iceberg.BaseSnapshot.cacheManifests(BaseSnapshot.java:146) at org.apache.iceberg.BaseSnapshot.dataManifests(BaseSnapshot.java:172) at org.apache.iceberg.MergingSnapshotProducer.apply(MergingSnapshotProducer.java:826) at org.apache.iceberg.SnapshotProducer.apply(SnapshotProducer.java:226) at org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:376) at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413) at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196) at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:374) at org.apache.iceberg.spark.source.SparkWrite.commitOperation(SparkWrite.java:222) at org.apache.iceberg.spark.source.SparkWrite.access$1300(SparkWrite.java:84) at org.apache.iceberg.spark.source.SparkWrite$BatchAppend.commit(SparkWrite.java:285) at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:422) at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:382) at org.apache.spark.sql.execution.datasources.v2.AppendDataExec.writeWithV2(WriteToDataSourceV2Exec.scala:248) at org.apache.spark.sql.execution.datasources.v2.V2ExistingTableWriteExec.run(WriteToDataSourceV2Exec.scala:360) at org.apache.spark.sql.execution.datasources.v2.V2ExistingTableWriteExec.run$(WriteToDataSourceV2Exec.scala:359) at org.apache.spark.sql.execution.datasources.v2.AppendDataExec.run(WriteToDataSourceV2Exec.scala:248) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:554) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:107) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:554) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:530) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:97) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:84) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142) at org.apache.spark.sql.DataFrameWriterV2.runCommand(DataFrameWriterV2.scala:195) at org.apache.spark.sql.DataFrameWriterV2.append(DataFrameWriterV2.scala:149) ``` Inspecting Glue and S3 we see the following: - Glue : metadata = `s3://bucket/iceberg/db/tableName/metadata/metadata2.json`, previous_metadata = `s3://bucket/iceberg/db/tableName/metadata/metadata1.json` - S3 : Both metadata.json files exists in S3. However metadata2.json file is pointing to manifest file location that does not exists We are unable to use the table after this without manual rollback of the metadata file. Is there a bug in optimistic locking with Glue catalog in this case? Would appreciate some pointers on what would have caused this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org