haizhou-zhao opened a new issue, #11109:
URL: https://github.com/apache/iceberg/issues/11109

   ### Query engine
   
   Spark
   
   ### Question
   
   ## Background
   This is another Hive/Hadoop and REST Catalog behavior discrepancies 
discovered while enabling integration test on REST catalog. The assumption here 
is all the existing Spark integration tests should pass as-is on REST Catalog 
just like how they would pass using Hive & Hadoop Catalog, because conceptually 
Spark expects the same behavior out of Iceberg, no matter what catalog types 
are used. Reference issue: https://github.com/apache/iceberg/issues/11079
   
   ## What
   
https://github.com/apache/iceberg/blob/main/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMetadataTables.java#L813
 
   
   This integration test passes on Hive & Hadoop Catalog, but does not pass on 
REST Catalog reference implementation (RESTCatalogAdapter over JdbcCatalog).
   
   ## Why
   Root Cause is REST Catalog reference implementation, comparing to Hive & 
Hadoop Catalog, will run extra logic on server side when responding to a 
`CREATE OR REPLACE ${table}` spark command. A `CREATE OR REPLACE ${table}` 
command will trigger a `RemoveSnapshotRef` (`implements MetadataUpdate`) change 
to be sent (within UpdateTableRequest) to REST Catalog server side. When the 
reference implementation server process this change, it will run `removeRef` 
method when building the replacement metadata, within which, snapshot log is 
cleared: 
https://github.com/apache/iceberg/blob/113c6e7/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1250.
 However, Hive and Hadoop Catalog does not run that method when building 
replacement metadata. At the end of the day, running `CREATE OR REPLACE 
${table}` on top of REST Catalog will result in failure of validating snapshot 
log is kept intact after `CREATE OR REPLACE` command.
   
   ## Questions
   1. Should snapshot history be kept intact after table replacement call based 
on spec definition, no matter the catalog type used?
   2. On table replacement, If `snapshot-log` should get cleared, then does 
`snapshots` list itself need to be cleared as well?
   3. Should the answer to 2 questions above vary based on catalog types (i.e. 
they are catalog implementation details, not spec issue) - that Hive & Hadoop 
catalog are doing the right thing to not clear `snapshot-log` or `snapshots` on 
table replacement; and that since REST catalog implementation is not controlled 
by this repo, it can choose its implementation details freely (whether clear or 
not clear `snapshot-log` & `snapshots` on replacement). **In this case, should 
we change the REST Catalog reference implementation (RESTCatalogAdapter on 
JdbcCatalog) so that its behavior is the same as Hive & Hadoop Catalog (where 
snapshot history is not cleared on replacement)? Or should we change the 
integration test so that we don't check snapshot history being intact after 
table replacement?**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to