liurenjie1024 commented on PR #1821:
URL: https://github.com/apache/iceberg-rust/pull/1821#issuecomment-3510572330

   > So I apologize because I kept posting comments as I thought I understood 
what was happening, then I'd second guess myself and delete the comment. So I 
just set it to draft while I investigated. As always, I appreciate your 
patience. Basically all of these changes have to do with Iceberg Java's 
`TestAddFilesProcedure` suite.
   > 
   > Here's what I can summarize from my findings today:
   > 
   > > We don't this check? I think if we already found a field by id, then we 
should just use this column?
   > 
   > You're right that we normally trust field IDs, but we do need the name 
check when `name_mapping` is present. In `add_files` scenarios (like Iceberg 
Java's `TestAddFilesProcedure.addDataPartitioned`), Parquet field IDs can 
conflict with Iceberg field IDs:
   > 
   > * Parquet: field_id=1→"name", field_id=2→"dept"
   > * Iceberg: field_id=1→"id", field_id=2→"name"
   > 
   > Without name checking, when looking for Iceberg field_id=2 ("name"), we'd 
find Parquet field_id=2 ("dept") and read the wrong column.
   > 
   > To fix this, we only check names when `name_mapping` is present (indicates 
potential conflicts). Without `name_mapping`, name mismatches are just column 
renames, so we trust the field ID.
   > 
   > > This should not be first step. According the projection rule, this only 
happens as first check when look up by id failed.
   > 
   > You're right that the spec says to check these rules when a field is "not 
present." However, Java checks partition constants before Parquet field IDs 
(`BaseParquetReaders.java:299`), and this is intentional. In `add_files`, 
partition columns can exist in both Parquet and partition metadata. The 
partition metadata is authoritative—it defines which partition this file 
belongs to. If we check Parquet first, we'd read the wrong values.
   > 
   > The spec's intent is that identity-partitioned fields are "not present" in 
data files by definition, even if they physically exist in the file.
   > 
   > This design was the only way I could get all of the tests in Iceberg 
Java's test suite to pass, and the subtlety seems to be that the spec is not 
totally clear on what to do when metadata conflicts between Iceberg and Parquet 
after a migration or schema change, and I had to choose to call invalid 
metadata "not present" as Java does.
   > 
   > Please feel free to let me know if I misunderstood. I tried to make a test 
that describes the scenario, and add comments on why the design is the way it 
is.
   
   Thanks for the explaination, it's clear and easy to understand. I'll review 
the pr again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to