shaeqahmed opened a new issue, #9411: URL: https://github.com/apache/iceberg/issues/9411
### Apache Iceberg version 1.4.2 (latest release) ### Query engine None ### Please describe the bug 🐞 Similar issue that i found that was supposed to be fixed in older version: https://github.com/apache/iceberg/issues/7151 We have a Java Iceberg Code that processes from a FIFO queue and does commits to Iceberg in single threaded fashion. I have confirmed that we are not making commits anywhere to a table at the same time. However, when doing a few commits back to back in a row, at some point we encountered the following WARN log indicating that Glue detected a concurrent update, and it was retrying: ``` Retrying task after failure: Cannot commit glue_catalog.matano.cloudflare_http_request because Glue detected concurrent update org.apache.iceberg.exceptions.CommitFailedException: Cannot commit glue_catalog.matano.cloudflare_http_request because Glue detected concurrent update at org.apache.iceberg.aws.glue.GlueTableOperations.handleAWSExceptions(GlueTableOperations.java:355) ~[output.jar:?] at org.apache.iceberg.aws.glue.GlueTableOperations.doCommit(GlueTableOperations.java:180) ... ``` But immediately after this log, while attempting to refresh the Iceberg metadata there is a iceberg NotFoundException as the current metadata location doesn't exist or no longer exists. ``` INFO BaseMetastoreTableOperations - Refreshing table metadata from new version: s3://redacted-bucket/lake/cloudflare_http_request/metadata/xxx-e3e8a38dbdc4.metadata.json ERROR IcebergMetadataWriter - org.apache.iceberg.exceptions.NotFoundException: Location does not exist: s3://redacted-bucket/lake/cloudflare_http_request/metadata/xxx-e3e8a38dbdc4.metadata.json ``` **This has resulted in our table becoming corrupt and the availability of our data lake service being effected until we manually fixed the table by refrencing the Glue `previous_metadata_location` and overriding the invalid current `metadata_location` with it.** It looks to me that when experiencing a CommitFailedException (CFE) these are retried internally and in any case should not result in a corrupt table even if all tried fail. Our code looks as follows, as we catch all exceptions: ``` // tableObj is our class, and a thin wrapper containing the Iceberg Java Table class logger.info("Committing for tables: ${tableObjs.keys}") start = System.currentTimeMillis() runBlocking { for (tableObj in tableObjs.values) { launch(Dispatchers.IO) { try { if (tableObj.isInitalized()) { tableObj.getAppendFiles().commit() } } catch (e: Exception) { logger.error(e.message) e.printStackTrace() failures.addAll(tableObj.sqsMessageIds) } } } } logger.info("Committed tables in ${System.currentTimeMillis() - start} ms") ``` Is this a bug in the Glue Iceberg code, or how should we protect ourselves from a situation where the Iceberg table is left pointing to an invalid location because of failed commits due to concurrent modifications thrown by Glue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org