shaeqahmed opened a new issue, #9411:
URL: https://github.com/apache/iceberg/issues/9411

   ### Apache Iceberg version
   
   1.4.2 (latest release)
   
   ### Query engine
   
   None
   
   ### Please describe the bug 🐞
   
   Similar issue that i found that was supposed to be fixed in older version: 
https://github.com/apache/iceberg/issues/7151
   
   We have a Java Iceberg Code that processes from a FIFO queue and does 
commits to Iceberg in single threaded fashion. I have confirmed that we are not 
making commits anywhere to a table at the same time. However, when doing a few 
commits back to back in a row, at some point we encountered the following WARN 
log indicating that Glue detected a concurrent update, and it was retrying:
   
   ```
   Retrying task after failure: Cannot commit 
glue_catalog.matano.cloudflare_http_request because Glue detected concurrent 
update org.apache.iceberg.exceptions.CommitFailedException: Cannot commit 
glue_catalog.matano.cloudflare_http_request because Glue detected concurrent 
update at 
org.apache.iceberg.aws.glue.GlueTableOperations.handleAWSExceptions(GlueTableOperations.java:355)
 ~[output.jar:?] at 
org.apache.iceberg.aws.glue.GlueTableOperations.doCommit(GlueTableOperations.java:180)
 
   ...
   ```
   
   But immediately after this log, while attempting to refresh the Iceberg 
metadata there is a iceberg NotFoundException as the current metadata location 
doesn't exist or no longer exists.
   
   ```
   INFO BaseMetastoreTableOperations - Refreshing table metadata from new 
version: 
s3://redacted-bucket/lake/cloudflare_http_request/metadata/xxx-e3e8a38dbdc4.metadata.json
   
   ERROR IcebergMetadataWriter - 
org.apache.iceberg.exceptions.NotFoundException: Location does not exist: 
s3://redacted-bucket/lake/cloudflare_http_request/metadata/xxx-e3e8a38dbdc4.metadata.json
   ```
   
   **This has resulted in our table becoming corrupt and the availability of 
our data lake service being effected until we manually fixed the table by 
refrencing the Glue `previous_metadata_location` and overriding the invalid 
current `metadata_location` with it.**
   
   It looks to me that when experiencing a CommitFailedException (CFE) these 
are retried internally and in any case should not result in a corrupt table 
even if all tried fail. Our code looks as follows, as we catch all exceptions: 
   
   ```
   // tableObj is our class, and a thin wrapper containing the Iceberg Java 
Table class
   
           logger.info("Committing for tables: ${tableObjs.keys}")
           start = System.currentTimeMillis()
           runBlocking {
               for (tableObj in tableObjs.values) {
                   launch(Dispatchers.IO) {
                       try {
                           if (tableObj.isInitalized()) {
                               tableObj.getAppendFiles().commit()
                           }
                       } catch (e: Exception) {
                           logger.error(e.message)
                           e.printStackTrace()
                           failures.addAll(tableObj.sqsMessageIds)
                       }
                   }
               }
           }
   
           logger.info("Committed tables in ${System.currentTimeMillis() - 
start} ms")
   ```
   
   Is this a bug in the Glue Iceberg code, or how should we protect ourselves 
from a situation where the Iceberg table is left pointing to an invalid 
location because of failed commits due to concurrent modifications thrown by 
Glue?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to