haizhou-zhao opened a new issue, #11109: URL: https://github.com/apache/iceberg/issues/11109
### Query engine Spark ### Question ## Background This is another Hive/Hadoop and REST Catalog behavior discrepancies discovered while enabling integration test on REST catalog. The assumption here is all the existing Spark integration tests should pass as-is on REST Catalog just like how they would pass using Hive & Hadoop Catalog, because conceptually Spark expects the same behavior out of Iceberg, no matter what catalog types are used. Reference issue: https://github.com/apache/iceberg/issues/11079 ## What https://github.com/apache/iceberg/blob/main/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMetadataTables.java#L813 This integration test passes on Hive & Hadoop Catalog, but does not pass on REST Catalog reference implementation (RESTCatalogAdapter over JdbcCatalog). ## Why Root Cause is REST Catalog reference implementation, comparing to Hive & Hadoop Catalog, will run extra logic on server side when responding to a `CREATE OR REPLACE ${table}` spark command. A `CREATE OR REPLACE ${table}` command will trigger a `RemoveSnapshotRef` (`implements MetadataUpdate`) change to be sent (within UpdateTableRequest) to REST Catalog server side. When the reference implementation server process this change, it will run `removeRef` method when building the replacement metadata, within which, snapshot log is cleared: https://github.com/apache/iceberg/blob/113c6e7/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1250. However, Hive and Hadoop Catalog does not run that method when building replacement metadata. At the end of the day, running `CREATE OR REPLACE ${table}` on top of REST Catalog will result in failure of validating snapshot log is kept intact after `CREATE OR REPLACE` command. ## Questions 1. Should snapshot history be kept intact after table replacement call based on spec definition, no matter the catalog type used? 2. On table replacement, If `snapshot-log` should get cleared, then does `snapshots` list itself need to be cleared as well? 3. Should the answer to 2 questions above vary based on catalog types (i.e. they are catalog implementation details, not spec issue) - that Hive & Hadoop catalog are doing the right thing to not clear `snapshot-log` or `snapshots` on table replacement; and that since REST catalog implementation is not controlled by this repo, it can choose its implementation details freely (whether clear or not clear `snapshot-log` & `snapshots` on replacement). **In this case, should we change the REST Catalog reference implementation (RESTCatalogAdapter on JdbcCatalog) so that its behavior is the same as Hive & Hadoop Catalog (where snapshot history is not cleared on replacement)? Or should we change the integration test so that we don't check snapshot history being intact after table replacement?** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org