mzzz-zzm opened a new issue, #1038:
URL: https://github.com/apache/iceberg-go/issues/1038

   ### Apache Iceberg version
   
   main (development)
   
   ### Please describe the bug 🐞
   
   iceberg-go writes v2 manifests that silently omit 
`data_file.distinct_counts`. The Iceberg v2 spec lists `distinct_counts` (field 
id 111) as a writable optional field on `data_file`, but it is not declared in 
any of our per-version Avro record schemas, so the encoder never serializes it. 
Discovered while implementing #1001 (v3 omits the same field, on purpose).
   
   ### Affected versions
   
   Writer paths for **v1 and v2** manifests on `main` (`c62af3c`) and earlier. 
v3 is unaffected — the field is deprecated on v3.
   
   ### Root cause
   
   `dataFile.DistinctCounts` carries the Avro tag `avro:"distinct_counts"` 
([manifest.go#L1802](https://github.com/apache/iceberg-go/blob/main/manifest.go#L1802)),
 but the per-version record schemas in 
[`internal/avro_schemas.go`](https://github.com/apache/iceberg-go/blob/main/internal/avro_schemas.go)
 do not include a `distinct_counts` field declaration:
   
   - `data_file_v1` (line 238) — no `distinct_counts`
   - `data_file_v2` (line 276) — no `distinct_counts`
   - `data_file_v3` (line 360) — no `distinct_counts` (intentional, per #1001)
   
   `hamba/avro` writes only fields declared in the schema, so the Go-side 
pointer is dropped at encode time on every version.
   
   ### Reproduction
   
   Drop the test below into `manifest_test.go` and run against `main`. It fails 
today; it will pass once `data_file_v2` declares `distinct_counts`.
   
   ```go
   // TestWriteManifestV2KeepsDistinctCounts is a regression guard that v2
   // manifest writers preserve data_file.distinct_counts (id 111) per the
   // Iceberg v2 spec.
   func (m *ManifestTestSuite) TestWriteManifestV2KeepsDistinctCounts() {
       partitionSpec := NewPartitionSpecID(0)
       snapshotID := int64(12345678)
       seqNum := int64(9876)
   
       dataFileBuilder, err := NewDataFileBuilder(
           partitionSpec,
           EntryContentData,
           "s3://bucket/ns/table/data/distinct.parquet",
           ParquetFile,
           map[int]any{},
           map[int]string{},
           map[int]int{},
           10,
           10*1000,
       )
       m.Require().NoError(err)
       dataFileBuilder.DistinctValueCounts(map[int]int64{1: 42})
   
       var buf bytes.Buffer
       file, err := WriteManifest(
           "s3://bucket/ns/table/metadata/distinct.avro", &buf, 2,
           partitionSpec,
           NewSchema(0,
               NestedField{ID: 1, Name: "id", Type: Int64Type{}, Required: 
true},
           ),
           snapshotID,
           []ManifestEntry{NewManifestEntry(
               EntryStatusADDED,
               &snapshotID,
               &seqNum, &seqNum,
               dataFileBuilder.Build(),
           )},
       )
       m.Require().NoError(err)
   
       entries, err := ReadManifest(file, &buf, false)
       m.Require().NoError(err)
       m.Require().Len(entries, 1)
   
       m.Equal(map[int]int64{1: 42}, 
entries[0].DataFile().DistinctValueCounts(),
           "v2 manifests must preserve distinct_counts")
   }
   ```
   
   **Observed today**: `entries[0].DataFile().DistinctValueCounts()` returns 
`nil`.
   **Expected**: `map[int]int64{1: 42}`.
   
   ### Suggested fix
   
   Add a `distinct_counts` field declaration (id 111, 
`NullableNode(newMapNode("k111_v112", IntNode, LongNode, 111, 112))`) to 
**`data_file_v1`** and **`data_file_v2`** in `internal/avro_schemas.go`, 
mirroring the existing `value_counts` (109), `null_value_counts` (110), and 
`nan_value_counts` (137) entries.
   
   **Do not** add it to `data_file_v3` — that omission is intentional per #1001 
/ [apache/iceberg#12182](https://github.com/apache/iceberg/pull/12182).
   
   ### Notes
   
   - After #1001 lands, the test above can use the existing 
`writeManifestWithDistinctCounts` helper and collapse to two lines:
     ```go
     func (m *ManifestTestSuite) TestWriteManifestV2KeepsDistinctCounts() {
         m.Equal(map[int]int64{1: 42}, m.writeManifestWithDistinctCounts(2),
             "v2 manifests must preserve distinct_counts")
     }
     ```
   - Discovered during #1001. The defensive guard added there in 
`v3writerImpl.prepareEntry` is forward-compatible with this fix (clears 
`DistinctCounts` on v3 entries before encode), so no v3 regression once the v2 
schema gap closes.
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to