[I] Nested field IDs in user-defined schema are reassigned during table creation, causing query failures in external engines [iceberg]

via GitHub Tue, 27 May 2025 05:20:12 -0700


sclee01 opened a new issue, #13164:
URL: https://github.com/apache/iceberg/issues/13164


   ### Apache Iceberg version
   
   1.9.0 (latest release)
   
   ### Query engine
   
   Snowflake
   
   ### Please describe the bug 🐞
   
   ### Description
   
   When creating an Iceberg table using a manually defined schema in Java — 
where field IDs are explicitly assigned — we observed that nested struct field 
IDs are silently reassigned during table creation.
   As a result, the schema stored in metadata.json no longer matches the 
original schema the user defined.
   
   This mismatch can break compatibility with external engines that rely on 
strict interpretation of metadata.json, including Snowflake and Spark.
   
   
   ### Problem
   
   Iceberg does not persist the original user-defined field IDs in nested 
structs, even when the input schema is fully specified.
   Instead, it generates new field IDs internally, leading to structural 
differences between the intended schema and what gets stored.
   
   We reproduced this behavior using both HadoopCatalog and GlueCatalog, and 
confirmed that the persisted metadata did not retain the original nested field 
IDs from the input schema.
   
   
   ### Reproduction
   The issue can be reproduced using the following code snippet.  
([reproduction_code_java.txt](https://github.com/user-attachments/files/20456395/reproduction_code_java.txt))
   
   As a simplified example, consider the case where the schema is defined as 
follows:
   
       Schema schema = new Schema(
               Types.NestedField.optional(1, "c1", Types.StringType.get()),
               Types.NestedField.optional(2, "c2", Types.StringType.get()),
               Types.NestedField.optional(4, "c3",
                       Types.StructType.of(
                               Types.NestedField.optional(3, "pos", 
Types.StringType.get())
                       )
               ),
               Types.NestedField.optional(5, "c4", Types.StringType.get())
       );
   
       System.out.println("BEFORE CREATE:");
       schema.columns().forEach(f -> System.out.println(f.fieldId() + " -> " + 
f.name()));
   
   
       if (!catalog.tableExists(tableIdentifier)) {
         catalog.createTable(tableIdentifier, schema, 
PartitionSpec.unpartitioned());
       }
   
   
       Table table = catalog.loadTable(tableIdentifier);
       Schema saved = table.schema();
       System.out.println("AFTER CREATE:");
       saved.columns().forEach(f -> System.out.println(f.fieldId() + " -> " + 
f.name()));
   
       /*
       Input schema (user-defined)
       table {
         1: c1: optional string
         2: c2: optional string
         4: c3: optional struct<3: pos: optional string>
         5: c4: optional string
       }
       */
   
   
       /*
       Stored schema (in metadata.json)
       table {
         1 : c1: optional string
         2: c2: optional string
         3: c3: optional struct<5: pos: optional string>
         4: c4: optional string
       }
       */
   
   > Note: The field IDs in the schema above are intentionally not sequential. 
This reflects a common pattern in real-world ingestion pipelines where fields 
may be reordered, extended, or composed in nested json.
   While assigning IDs in strict sequence might incidentally avoid this issue 
in simple cases, nested schemas make this error-prone and harder to control — 
so this should not be treated as a user mistake.
   
   You can see even though the user provided pos as field ID 3, it was 
reassigned to 5 inside the struct.
   
   
   
   ###  Why This Matters
   
   After writing the Iceberg table using Parquet, the data and metadata were 
successfully stored in S3.
   However, when querying the table in Snowflake, the following error occurs:  
([reproduction_code_java_write_parquet.txt](https://github.com/user-attachments/files/20456586/reproduction_code_java_write_parquet.txt))
   
   > Parquet file schema node type does not match table column. Parquet Node 
Type: 'GROUP', Expected Type: 'PRIMITIVE', Parquet Path: 'c3', Column Name: 
'C4', Column Type: 'string', File Name: 
'warehouse/mynamespace.db/table_name_000/ceff.parquet'
   
   The same issue occurs in Apache Spark, where it cannot match the field ID 
from the metadata with the expected projection.
   This causes failure when reading data that was otherwise successfully 
written.
   I have also captured a screenshot of the Snowflake error to help validate 
the issue, if that would be helpful.
   
   
![Image](https://github.com/user-attachments/assets/ac18ff09-f5c0-4369-b22d-5f7f4e2a0a75)
   
   
   
   
   
   ###  Temporary Workaround
   We manually modified the metadata.json to restore the original nested field 
ID, after which both Snowflake and Spark were able to read the table 
successfully.
   
   
   
   
   ###  Suggestion
   I understand that Iceberg may reassign field IDs to ensure internal 
consistency, but I believe there is a valid use case — especially in Java 
environments — for preserving explicitly assigned IDs, particularly when full 
schemas are user-defined and intended for external consumption.
   
   I'm raising this issue based on discussion above, and I think it could be a 
great opportunity for me to try addressing the bug with some guidance from the 
community.
   
   
   Thank you for your time.
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [x] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Nested field IDs in user-defined schema are reassigned during table creation, causing query failures in external engines [iceberg]

Reply via email to