sclee01 opened a new issue, #13164: URL: https://github.com/apache/iceberg/issues/13164
### Apache Iceberg version 1.9.0 (latest release) ### Query engine Snowflake ### Please describe the bug 🐞 ### Description When creating an Iceberg table using a manually defined schema in Java — where field IDs are explicitly assigned — we observed that nested struct field IDs are silently reassigned during table creation. As a result, the schema stored in metadata.json no longer matches the original schema the user defined. This mismatch can break compatibility with external engines that rely on strict interpretation of metadata.json, including Snowflake and Spark. ### Problem Iceberg does not persist the original user-defined field IDs in nested structs, even when the input schema is fully specified. Instead, it generates new field IDs internally, leading to structural differences between the intended schema and what gets stored. We reproduced this behavior using both HadoopCatalog and GlueCatalog, and confirmed that the persisted metadata did not retain the original nested field IDs from the input schema. ### Reproduction The issue can be reproduced using the following code snippet. ([reproduction_code_java.txt](https://github.com/user-attachments/files/20456395/reproduction_code_java.txt)) As a simplified example, consider the case where the schema is defined as follows: Schema schema = new Schema( Types.NestedField.optional(1, "c1", Types.StringType.get()), Types.NestedField.optional(2, "c2", Types.StringType.get()), Types.NestedField.optional(4, "c3", Types.StructType.of( Types.NestedField.optional(3, "pos", Types.StringType.get()) ) ), Types.NestedField.optional(5, "c4", Types.StringType.get()) ); System.out.println("BEFORE CREATE:"); schema.columns().forEach(f -> System.out.println(f.fieldId() + " -> " + f.name())); if (!catalog.tableExists(tableIdentifier)) { catalog.createTable(tableIdentifier, schema, PartitionSpec.unpartitioned()); } Table table = catalog.loadTable(tableIdentifier); Schema saved = table.schema(); System.out.println("AFTER CREATE:"); saved.columns().forEach(f -> System.out.println(f.fieldId() + " -> " + f.name())); /* Input schema (user-defined) table { 1: c1: optional string 2: c2: optional string 4: c3: optional struct<3: pos: optional string> 5: c4: optional string } */ /* Stored schema (in metadata.json) table { 1 : c1: optional string 2: c2: optional string 3: c3: optional struct<5: pos: optional string> 4: c4: optional string } */ > Note: The field IDs in the schema above are intentionally not sequential. This reflects a common pattern in real-world ingestion pipelines where fields may be reordered, extended, or composed in nested json. While assigning IDs in strict sequence might incidentally avoid this issue in simple cases, nested schemas make this error-prone and harder to control — so this should not be treated as a user mistake. You can see even though the user provided pos as field ID 3, it was reassigned to 5 inside the struct. ### Why This Matters After writing the Iceberg table using Parquet, the data and metadata were successfully stored in S3. However, when querying the table in Snowflake, the following error occurs: ([reproduction_code_java_write_parquet.txt](https://github.com/user-attachments/files/20456586/reproduction_code_java_write_parquet.txt)) > Parquet file schema node type does not match table column. Parquet Node Type: 'GROUP', Expected Type: 'PRIMITIVE', Parquet Path: 'c3', Column Name: 'C4', Column Type: 'string', File Name: 'warehouse/mynamespace.db/table_name_000/ceff.parquet' The same issue occurs in Apache Spark, where it cannot match the field ID from the metadata with the expected projection. This causes failure when reading data that was otherwise successfully written. I have also captured a screenshot of the Snowflake error to help validate the issue, if that would be helpful.  ### Temporary Workaround We manually modified the metadata.json to restore the original nested field ID, after which both Snowflake and Spark were able to read the table successfully. ### Suggestion I understand that Iceberg may reassign field IDs to ensure internal consistency, but I believe there is a valid use case — especially in Java environments — for preserving explicitly assigned IDs, particularly when full schemas are user-defined and intended for external consumption. I'm raising this issue based on discussion above, and I think it could be a great opportunity for me to try addressing the bug with some guidance from the community. Thank you for your time. ### Willingness to contribute - [ ] I can contribute a fix for this bug independently - [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [ ] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org