namrathamyske opened a new issue, #9737: URL: https://github.com/apache/iceberg/issues/9737
### Apache Iceberg version main (development) ### Query engine None ### Please describe the bug 🐞 regarding this PR: https://github.com/apache/iceberg/pull/9131 - the change reads as: Schema for a branch should return table schema Shouldn't the Schema of a branch be the same as when the branch was created - as opposed to the above change - ie., to move it to a future state of schema change on the table? isn't the concept of branching to create a baseline based on the state of data and metadata of the table - as to - when it was branched? can you pl. help me understand the rationale behind this change? Please consider this example: ``` -- create a table with a single column and insert a value spark-sql (default)> create table t (s string); spark-sql (default)> insert into t values ('foo'); -- create a branch, the schema is the same as the original table spark-sql (default)> alter table t create branch b1; ``` Describe and Query the table & branch: ``` spark-sql (default)> describe default.t; s string spark-sql (default)> select * from default.t; s foo spark-sql (default)> describe default.t.branch_b1; s string spark-sql (default)> select * from default.t.branch_b1; s foo ``` Alter the table - using the below statement to diverge the definition of the table: ``` spark-sql (default)> alter table t add column i int; spark-sql (default)> alter table t del column s; spark-sql (default)> insert into t values (111); ``` Behavior before the above PR: [Please NOTE that the changes in the main branch - DID NOT IMPACT the data and metadata on the branch - which lookslike is the desirable behavior for any branching concept] ``` spark-sql (default)> describe default.t; i int spark-sql (default)> select * from default.t; i 111 spark-sql (default)> describe default.t.branch_b1; s string spark-sql (default)> select * from default.t.branch_b1; s foo ``` Behavior after the above PR: [Please NOTE that a schema change in the main branch - IMPACTED the data and metadata available on the branch - this feels like an undesirable behavior;] ``` spark-sql (default)> describe default.t; i int spark-sql (default)> select * from default.t; i 111 spark-sql (default)> describe default.t.branch_b1; i int spark-sql (default)> select * from default.t.branch_b1; i --no-data-- ``` Unit test to replicate the issue: ``` @Test public void testSchemaChange() throws Exception { Assume.assumeFalse("Avro does not support metadata delete", fileFormat.equals("avro")); createAndInitUnpartitionedTable(); sql("INSERT INTO TABLE %s VALUES (1, 'hr'), (2, 'hardware'), (null, 'hr')", tableName); createBranchIfNeeded(); String sql = String.format("SELECT * FROM %s ORDER BY id", selectTarget()); spark.sql(sql).show(); /** * +----+--------+ * | id| dep| * +----+--------+ * |NULL| hr| * | 1| hr| * | 2|hardware| * +----+--------+ */ // Metadata Delete Table table = Spark3Util.loadIcebergTable(spark, tableName); table.refresh(); table.updateSchema().deleteColumn("dep").commit(); sql = String.format("SELECT * FROM %s ORDER BY id", selectTarget()); spark.sql(sql).show(); /** * Data loss in branch, impacted as we consume schema from table schema * +----+ * | id| * +----+ * |NULL| * | 1| * | 2| * +----+ */ sql = String.format("SELECT * FROM %s ORDER BY id", tableName); spark.sql(sql).show(); /** * +----+ * | id| * +----+ * |NULL| * | 1| * | 2| * +----+ */ } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org