mbutrovich commented on PR #1821:
URL: https://github.com/apache/iceberg-rust/pull/1821#issuecomment-3512199866

   > Thanks @mbutrovich for this explaination. I still didn't find related code 
in java repo, could you help to point out?
   
   I fed my notes from last week into Claude to summarize:
   
   #### The Core Issue
   
   When Spark writes partitioned Parquet files using `add_files`:
   - **Partition columns are excluded** from the Parquet file (they're in the 
directory structure)
   - **Remaining field IDs are renumbered** starting from 1 in the Parquet file
   - This creates conflicts where field_id=1 in the Iceberg schema (e.g., 
partition column "id") conflicts with field_id=1 in Parquet (e.g., "name")
   
   #### Java's Solution - "Constants First"
   
   Looking at 
`iceberg/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java:197-216`,
 Java's `replaceWithMetadataReader` method has this precedence:
   
   1. Special metadata columns (`ROW_ID`, etc.)
   2. Partition constants (`idToConstant.containsKey(id)`) - Line 206-208
   3. Parquet field ID match (returns the reader if not `null`)
   4. Then fallback to `initial-default` or `null`
   
   So **Java prioritizes partition constants BEFORE checking Parquet field 
IDs**. This is called in `BaseParquetReaders.java:250-252`:
   ```java
   ParquetValueReader<?> reader = ParquetValueReaders.replaceWithMetadataReader(
       id, readersById.get(id), idToConstant, constantDefinitionLevel);
   reorderedFields.add(defaultReader(field, reader, constantDefinitionLevel));
   ```
   #### The Spec Ambiguity
   
   The Iceberg spec says:
   "Values for field ids which are not present in a data file must be resolved 
according the following rules:
   1. Return the value from partition metadata if an Identity Transform 
exists..."
   
   The ambiguity: When `field_id=1` exists in Parquet but points to the wrong 
column, is it:
   - A) **"Present"** (because `field_id=1` physically exists in Parquet 
metadata)?
   - B) **"Not present"** (because the semantically correct column isn't in the 
file)?
   
   Java treats it as "Not present" (option B) and uses partition constants 
first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to