liurenjie1024 commented on PR #1821:
URL: https://github.com/apache/iceberg-rust/pull/1821#issuecomment-3510572330
> So I apologize because I kept posting comments as I thought I understood
what was happening, then I'd second guess myself and delete the comment. So I
just set it to draft while I investigated. As always, I appreciate your
patience. Basically all of these changes have to do with Iceberg Java's
`TestAddFilesProcedure` suite.
>
> Here's what I can summarize from my findings today:
>
> > We don't this check? I think if we already found a field by id, then we
should just use this column?
>
> You're right that we normally trust field IDs, but we do need the name
check when `name_mapping` is present. In `add_files` scenarios (like Iceberg
Java's `TestAddFilesProcedure.addDataPartitioned`), Parquet field IDs can
conflict with Iceberg field IDs:
>
> * Parquet: field_id=1→"name", field_id=2→"dept"
> * Iceberg: field_id=1→"id", field_id=2→"name"
>
> Without name checking, when looking for Iceberg field_id=2 ("name"), we'd
find Parquet field_id=2 ("dept") and read the wrong column.
>
> To fix this, we only check names when `name_mapping` is present (indicates
potential conflicts). Without `name_mapping`, name mismatches are just column
renames, so we trust the field ID.
>
> > This should not be first step. According the projection rule, this only
happens as first check when look up by id failed.
>
> You're right that the spec says to check these rules when a field is "not
present." However, Java checks partition constants before Parquet field IDs
(`BaseParquetReaders.java:299`), and this is intentional. In `add_files`,
partition columns can exist in both Parquet and partition metadata. The
partition metadata is authoritative—it defines which partition this file
belongs to. If we check Parquet first, we'd read the wrong values.
>
> The spec's intent is that identity-partitioned fields are "not present" in
data files by definition, even if they physically exist in the file.
>
> This design was the only way I could get all of the tests in Iceberg
Java's test suite to pass, and the subtlety seems to be that the spec is not
totally clear on what to do when metadata conflicts between Iceberg and Parquet
after a migration or schema change, and I had to choose to call invalid
metadata "not present" as Java does.
>
> Please feel free to let me know if I misunderstood. I tried to make a test
that describes the scenario, and add comments on why the design is the way it
is.
Thanks for the explaination, it's clear and easy to understand. I'll review
the pr again.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]