mbutrovich commented on PR #1821:
URL: https://github.com/apache/iceberg-rust/pull/1821#issuecomment-3497736468
I asked Claude to try to summarize the issue for me, based on what Iceberg
Java does, the test scenario failing, and what we're seeing in the
`RecordBatchTransformer`:
Scenario: Iceberg Java Spark test
(`TestAddFilesProcedure.addPartitionToPartitioned`) writes partitioned Parquet
files by excluding partition columns and renumbering remaining field IDs
starting from 1. When imported via `add_files`:
- Iceberg schema: field_id=1 → "id" (partition column), field_id=2 → "name",
field_id=3 → "dept", field_id=4 → "subdept"
- Spark Parquet file: field_id=1 → "name", field_id=2 → "dept", field_id=3 →
"subdept" (renumbered, "id" excluded)
Java behavior (BaseParquetReaders.java:299-314):
```java
if (idToConstant.containsKey(id)) {
// Use constant from partition metadata
} else if (reader != null) {
// Use Parquet column by field ID
}`
```
Java checks `idToConstant` before checking if a Parquet reader exists,
prioritizing partition constants over field ID matches.
Spec says (https://iceberg.apache.org/spec/#column-projection):
"Columns in Iceberg data files are selected by field id."
"Values for field ids which are not present in a data file must be resolved
according to..." [fallback rules including partition constants]
Question: When field_id=1 exists in Parquet but refers to a semantically
different column ("name" instead of "id"), should it be considered:
1. "Present" (use the Parquet column, even though it's wrong), or
2. "Not present" (apply fallback rules, use partition constant)?
Java's approach suggests partition constants take priority even when field
IDs exist in Parquet. Is this spec-compliant handling of field ID conflicts
from `add_files`, or does Java implement additional logic beyond the spec?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]