mzzz-zzm opened a new issue, #1038: URL: https://github.com/apache/iceberg-go/issues/1038
### Apache Iceberg version main (development) ### Please describe the bug 🐞 iceberg-go writes v2 manifests that silently omit `data_file.distinct_counts`. The Iceberg v2 spec lists `distinct_counts` (field id 111) as a writable optional field on `data_file`, but it is not declared in any of our per-version Avro record schemas, so the encoder never serializes it. Discovered while implementing #1001 (v3 omits the same field, on purpose). ### Affected versions Writer paths for **v1 and v2** manifests on `main` (`c62af3c`) and earlier. v3 is unaffected — the field is deprecated on v3. ### Root cause `dataFile.DistinctCounts` carries the Avro tag `avro:"distinct_counts"` ([manifest.go#L1802](https://github.com/apache/iceberg-go/blob/main/manifest.go#L1802)), but the per-version record schemas in [`internal/avro_schemas.go`](https://github.com/apache/iceberg-go/blob/main/internal/avro_schemas.go) do not include a `distinct_counts` field declaration: - `data_file_v1` (line 238) — no `distinct_counts` - `data_file_v2` (line 276) — no `distinct_counts` - `data_file_v3` (line 360) — no `distinct_counts` (intentional, per #1001) `hamba/avro` writes only fields declared in the schema, so the Go-side pointer is dropped at encode time on every version. ### Reproduction Drop the test below into `manifest_test.go` and run against `main`. It fails today; it will pass once `data_file_v2` declares `distinct_counts`. ```go // TestWriteManifestV2KeepsDistinctCounts is a regression guard that v2 // manifest writers preserve data_file.distinct_counts (id 111) per the // Iceberg v2 spec. func (m *ManifestTestSuite) TestWriteManifestV2KeepsDistinctCounts() { partitionSpec := NewPartitionSpecID(0) snapshotID := int64(12345678) seqNum := int64(9876) dataFileBuilder, err := NewDataFileBuilder( partitionSpec, EntryContentData, "s3://bucket/ns/table/data/distinct.parquet", ParquetFile, map[int]any{}, map[int]string{}, map[int]int{}, 10, 10*1000, ) m.Require().NoError(err) dataFileBuilder.DistinctValueCounts(map[int]int64{1: 42}) var buf bytes.Buffer file, err := WriteManifest( "s3://bucket/ns/table/metadata/distinct.avro", &buf, 2, partitionSpec, NewSchema(0, NestedField{ID: 1, Name: "id", Type: Int64Type{}, Required: true}, ), snapshotID, []ManifestEntry{NewManifestEntry( EntryStatusADDED, &snapshotID, &seqNum, &seqNum, dataFileBuilder.Build(), )}, ) m.Require().NoError(err) entries, err := ReadManifest(file, &buf, false) m.Require().NoError(err) m.Require().Len(entries, 1) m.Equal(map[int]int64{1: 42}, entries[0].DataFile().DistinctValueCounts(), "v2 manifests must preserve distinct_counts") } ``` **Observed today**: `entries[0].DataFile().DistinctValueCounts()` returns `nil`. **Expected**: `map[int]int64{1: 42}`. ### Suggested fix Add a `distinct_counts` field declaration (id 111, `NullableNode(newMapNode("k111_v112", IntNode, LongNode, 111, 112))`) to **`data_file_v1`** and **`data_file_v2`** in `internal/avro_schemas.go`, mirroring the existing `value_counts` (109), `null_value_counts` (110), and `nan_value_counts` (137) entries. **Do not** add it to `data_file_v3` — that omission is intentional per #1001 / [apache/iceberg#12182](https://github.com/apache/iceberg/pull/12182). ### Notes - After #1001 lands, the test above can use the existing `writeManifestWithDistinctCounts` helper and collapse to two lines: ```go func (m *ManifestTestSuite) TestWriteManifestV2KeepsDistinctCounts() { m.Equal(map[int]int64{1: 42}, m.writeManifestWithDistinctCounts(2), "v2 manifests must preserve distinct_counts") } ``` - Discovered during #1001. The defensive guard added there in `v3writerImpl.prepareEntry` is forward-compatible with this fix (clears `DistinctCounts` on v3 entries before encode), so no v3 regression once the v2 schema gap closes. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
