[I] Table corruption using lock-free Hive commits [iceberg]

via GitHub Wed, 18 Dec 2024 12:35:43 -0800


sauliusvl opened a new issue, #11814:
URL: https://github.com/apache/iceberg/issues/11814


   ### Apache Iceberg version
   
   1.6.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   We observed the following situation happen a few times now when using 
lock-free Hive catalog commits introduced in 
https://github.com/apache/iceberg/pull/6570:
   
    We run an `ALTER TABLE table SET TBLPROPERTIES ('key' = 'value')` or any 
other operation that results in an Iceberg commit, either Spark or any other 
engine. For whatever reason the connection to the Hive metastore is broken and 
the HMS operation fails during the first attempt:
   ```
   WARN org.apache.hadoop.hive.metastore.RetryingMetaStoreClient: 
MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. 
alter_table_with_environmentContext
   org.apache.thrift.transport.TTransportException: java.net.SocketException: 
Connection reset
   <...>
   at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_alter_table_with_environment_context(ThriftHiveMetastore.java:1693)
   <...>
   at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:169)
   <...>
   at org.apache.iceberg.hive.MetastoreUtil.alterTable(MetastoreUtil.java:78)
   at 
org.apache.iceberg.hive.HiveOperationsBase.lambda$persistTable$0(HiveOperationsBase.java:112)
   <...>
   at 
org.apache.iceberg.hive.HiveTableOperations.doCommit(HiveTableOperations.java:239)
   at 
org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:135)
   <...>
   at org.apache.iceberg.spark.SparkCatalog.alterTable(SparkCatalog.java:345)
   <...>
   ```
   but the operation actually succeeds and updates the metadata location, which 
means that when the `RetryingMetaStoreClient` attempts resubmitting the 
operation, it fails with:
   ```
   MetaException(message:The table has been modified. The parameter value for 
key 'metadata_location' is '<new>'. The expected was value was '<previous>')
   ```
   The Iceberg commit is then considered failed and the new metadata file is 
cleaned up in the `finally` block 
[here](https://github.com/apache/iceberg/blob/b428fbc59bd1579f4dc918a5cd48fce667d81ce1/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L320)
 before retrying the commit. But the problem is that the Hive table has the new 
metadata location set, so when Iceberg tries refreshing the table it fails, 
because the new metadata file no longer exists, leaving the table in a 
corrupted state.
   
   I suppose a fix could be checking the exception and ignoring the case when 
the already set location is equal to the new metadata location, but parsing the 
error message sounds very hacky.
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [X] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Table corruption using lock-free Hive commits [iceberg]

Reply via email to