cassio-paesleme opened a new pull request, #897:
URL: https://github.com/apache/iceberg-go/pull/897

   ## Problem
   
   Iceberg Java renamed `added_data_files_count`, `existing_data_files_count`, 
and `deleted_data_files_count` to `added_files_count`, `existing_files_count`, 
and `deleted_files_count` (field IDs 504/505/506) in v1.4 via 
apache/iceberg#5338.
   
   Tables written before that change, or by engines still on older Java 
versions (Athena, some Trino deployments), embed the legacy names in the writer 
schema of every manifest list OCF file. When iceberg-go reads such a file, 
hamba/avro cannot match writer-schema fields to struct tags and silently leaves 
the three count fields at zero.
   
   Zero counts cause `HasAddedFiles()` and `HasExistingFiles()` to return 
false, which breaks `Table.Scan()` and (in combination with the fast-append 
filter bug fixed in #869) causes silent data loss on append.
   
   ## Fix
   
   Add parallel struct fields with the legacy avro tags to `manifestFile`. A 
single `normalizeLegacyCounts()` call, invoked immediately after Avro decode in 
both decode paths (`decodeManifests` and `decodeManifestsWithFallback`), 
promotes legacy field values into the canonical fields. All count getters and 
`Has*` methods remain simple one-liners.
   
   The write path emits both field names so that Athena readers (which still 
expect the legacy names) continue to see correct values.
   
   ## Design note
   
   Two other approaches were considered:
   
   **Per-getter coalescing** — each accessor checks both field names. Rejected 
because every accessor must remember the fallback, and any future legacy name 
would require N getter changes.
   
   **hamba/avro schema alias + `SchemaCompatibility.Resolve`** — define a 
reader schema with aliases and resolve against the writer schema. Rejected 
because `ocf.NewDecoder` hardcodes the writer schema from the file header and 
provides no way to inject a reader schema. Using a resolved schema would 
require bypassing the OCF decoder entirely.
   
   Post-decode normalization is a single call site, leaves the getters clean, 
and handles both the V2 zero-value case and the V1 `-1` (null/unknown) sentinel 
via a `<= 0` check.
   
   ## Testing
   
   Two new tests write raw Avro OCF using the pre-1.4 field names (V1 and V2) 
and assert the counts decode correctly.
   
   Verified end-to-end against a real Athena-written Iceberg table: interleaved 
Athena and iceberg-go appends show all rows visible after each write 
(docker/data-platform#406).
   
   ## Related
   
   - #869 — companion fast-append filter fix (a fast-append should never filter 
parent manifests; that fix is correct independently of this one)
   - apache/iceberg-go#890 — independent PR for the same issue; this PR uses 
post-decode normalization rather than the 
parallel-fields-with-per-getter-coalescing approach that was flagged in review 
on #890


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to