cassio-paesleme opened a new pull request, #897: URL: https://github.com/apache/iceberg-go/pull/897
## Problem Iceberg Java renamed `added_data_files_count`, `existing_data_files_count`, and `deleted_data_files_count` to `added_files_count`, `existing_files_count`, and `deleted_files_count` (field IDs 504/505/506) in v1.4 via apache/iceberg#5338. Tables written before that change, or by engines still on older Java versions (Athena, some Trino deployments), embed the legacy names in the writer schema of every manifest list OCF file. When iceberg-go reads such a file, hamba/avro cannot match writer-schema fields to struct tags and silently leaves the three count fields at zero. Zero counts cause `HasAddedFiles()` and `HasExistingFiles()` to return false, which breaks `Table.Scan()` and (in combination with the fast-append filter bug fixed in #869) causes silent data loss on append. ## Fix Add parallel struct fields with the legacy avro tags to `manifestFile`. A single `normalizeLegacyCounts()` call, invoked immediately after Avro decode in both decode paths (`decodeManifests` and `decodeManifestsWithFallback`), promotes legacy field values into the canonical fields. All count getters and `Has*` methods remain simple one-liners. The write path emits both field names so that Athena readers (which still expect the legacy names) continue to see correct values. ## Design note Two other approaches were considered: **Per-getter coalescing** — each accessor checks both field names. Rejected because every accessor must remember the fallback, and any future legacy name would require N getter changes. **hamba/avro schema alias + `SchemaCompatibility.Resolve`** — define a reader schema with aliases and resolve against the writer schema. Rejected because `ocf.NewDecoder` hardcodes the writer schema from the file header and provides no way to inject a reader schema. Using a resolved schema would require bypassing the OCF decoder entirely. Post-decode normalization is a single call site, leaves the getters clean, and handles both the V2 zero-value case and the V1 `-1` (null/unknown) sentinel via a `<= 0` check. ## Testing Two new tests write raw Avro OCF using the pre-1.4 field names (V1 and V2) and assert the counts decode correctly. Verified end-to-end against a real Athena-written Iceberg table: interleaved Athena and iceberg-go appends show all rows visible after each write (docker/data-platform#406). ## Related - #869 — companion fast-append filter fix (a fast-append should never filter parent manifests; that fix is correct independently of this one) - apache/iceberg-go#890 — independent PR for the same issue; this PR uses post-decode normalization rather than the parallel-fields-with-per-getter-coalescing approach that was flagged in review on #890 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
