laskoviymishka commented on code in PR #1039:
URL: https://github.com/apache/iceberg-go/pull/1039#discussion_r3208042674
##########
manifest_test.go:
##########
@@ -2115,6 +2115,67 @@ func (m *ManifestTestSuite)
TestManifestRoundTripSortOrderID() {
m.Equal(expectedSortOrderID, *got)
}
+// writeManifestWithDistinctCounts builds a single-entry manifest at the given
+// version with distinct_counts populated for one column, round-trips it
through
+// WriteManifest + ReadManifest, and returns the distinct counts observed by
+// the reader.
+func (m *ManifestTestSuite) writeManifestWithDistinctCounts(version int)
map[int]int64 {
+ partitionSpec := NewPartitionSpecID(0)
+ snapshotID := int64(12345678)
+ seqNum := int64(9876)
+
+ dataFileBuilder, err := NewDataFileBuilder(
+ partitionSpec,
+ EntryContentData,
+ "s3://bucket/ns/table/data/distinct.parquet",
+ ParquetFile,
+ map[int]any{},
+ map[int]string{},
+ map[int]int{},
+ 10,
+ 10*1000,
+ )
+ m.Require().NoError(err)
+ dataFileBuilder.DistinctValueCounts(map[int]int64{1: 42})
+
+ var buf bytes.Buffer
+ file, err := WriteManifest(
+ "s3://bucket/ns/table/metadata/distinct.avro", &buf, version,
+ partitionSpec,
+ NewSchema(0,
+ NestedField{ID: 1, Name: "id", Type: Int64Type{},
Required: true},
+ ),
+ snapshotID,
+ []ManifestEntry{NewManifestEntry(
+ EntryStatusADDED,
+ &snapshotID,
+ &seqNum, &seqNum,
+ dataFileBuilder.Build(),
+ )},
+ )
+ m.Require().NoError(err)
+
+ entries, err := ReadManifest(file, &buf, false)
+ m.Require().NoError(err)
+ m.Require().Len(entries, 1)
+
+ return entries[0].DataFile().DistinctValueCounts()
+}
+
+// TestWriteManifestV3OmitsDistinctCounts verifies that v3 manifest writers do
+// not surface data_file.distinct_counts on the read side (deprecated by the v3
+// spec; parity with Java apache/iceberg#12182). Today this passes because none
+// of the per-version Avro record schemas in internal/avro_schemas.go declare
+// a distinct_counts field, so the field is dropped at encode time on every
+// version. The v3writerImpl.prepareEntry guard that clears DistinctCounts is
+// forward-compatibility insurance: if the v2 schema is later fixed to declare
+// distinct_counts (the v2 spec requires it; tracked separately), this test
+// keeps the v3 path compliant without further code changes.
+func (m *ManifestTestSuite) TestWriteManifestV3OmitsDistinctCounts() {
+ got := m.writeManifestWithDistinctCounts(3)
+ m.Empty(got, "v3 manifests must omit distinct_counts")
Review Comment:
This assertion is true on `main` today without the guard — the per-version
Avro schemas don't declare `distinct_counts`, so the round-trip is empty either
way. You called this out in the PR body, but it means `m.Empty(got)` proves the
schema, not the guard.
I'd add a second assertion that pins the actual code path this PR
introduces: the in-memory `*dataFile` has its `DistinctCounts` cleared
post-write. Something like:
```go
entry := NewManifestEntry(EntryStatusADDED, &snapshotID, &seqNum, &seqNum,
dataFileBuilder.Build())
_, err := WriteManifest(... []ManifestEntry{entry})
m.Require().NoError(err)
m.Nil(entry.DataFile().(*dataFile).DistinctCounts, "v3 writer must clear
DistinctCounts on the entry's *dataFile")
```
That actually flips red if the `df.DistinctCounts = nil` line is removed,
which is the contract we want. The current round-trip assertion can stay as a
redundant on-disk check, or be deferred to land alongside #1038 when it stops
being vacuous.
##########
manifest.go:
##########
@@ -944,6 +944,19 @@ func (v3writerImpl) prepareEntry(entry *manifestEntry,
snapshotID int64) (Manife
}
}
+ // v3 spec deprecates data_file.distinct_counts (parity with Java
+ // apache/iceberg#12182). Today none of the per-version Avro schemas in
+ // internal/avro_schemas.go declare distinct_counts, so the field is
+ // already dropped on encode for every version. This guard is
+ // forward-compatibility insurance: if the v2 schema is later fixed to
+ // declare distinct_counts (v2 spec requires it; tracked separately),
the
+ // v3 path stays compliant by clearing the Avro-facing pointer here. The
+ // cached distinctCntMap and DistinctValueCounts() getter are left alone
+ // so in-memory readers in the same process keep their view.
+ if df, ok := entry.DataFile().(*dataFile); ok {
Review Comment:
The type assertion silently no-ops for any third-party `DataFile`
implementation. `DataFile` is a public interface, so external callers can pass
their own type and the guard becomes a no-op for them.
Probably fine to defer (the framing is "defensive insurance" anyway), but
I'd add one line to the comment so future readers don't think this is a hard
gate: `// Best-effort — only clears the Avro-facing pointer for the in-tree
*dataFile; third-party DataFile impls bypass this guard.`
##########
manifest.go:
##########
@@ -944,6 +944,19 @@ func (v3writerImpl) prepareEntry(entry *manifestEntry,
snapshotID int64) (Manife
}
}
+ // v3 spec deprecates data_file.distinct_counts (parity with Java
+ // apache/iceberg#12182). Today none of the per-version Avro schemas in
+ // internal/avro_schemas.go declare distinct_counts, so the field is
+ // already dropped on encode for every version. This guard is
+ // forward-compatibility insurance: if the v2 schema is later fixed to
+ // declare distinct_counts (v2 spec requires it; tracked separately),
the
+ // v3 path stays compliant by clearing the Avro-facing pointer here. The
+ // cached distinctCntMap and DistinctValueCounts() getter are left alone
+ // so in-memory readers in the same process keep their view.
+ if df, ok := entry.DataFile().(*dataFile); ok {
+ df.DistinctCounts = nil
Review Comment:
Worth flagging: this mutates the caller's `*dataFile` in place.
`DataFileBuilder.Build()` returns `b.d` directly (same pointer), so after
`WriteManifest` returns, the caller's reference now has `DistinctCounts == nil`
even though they explicitly set it via `DistinctValueCounts(...)`.
Also applies to the rewrite-existing path in
`table/snapshot_producers.go:215` (`wr.Existing(entry)`): we mutate entries
that came from `ReadManifest` and may be referenced elsewhere in the same
process.
I'd either clone the data file before clearing, or add a sentence to the
comment that `prepareEntry` is documented to take ownership of the entry's data
file. wdyt?
##########
manifest_test.go:
##########
@@ -2115,6 +2115,67 @@ func (m *ManifestTestSuite)
TestManifestRoundTripSortOrderID() {
m.Equal(expectedSortOrderID, *got)
}
+// writeManifestWithDistinctCounts builds a single-entry manifest at the given
+// version with distinct_counts populated for one column, round-trips it
through
+// WriteManifest + ReadManifest, and returns the distinct counts observed by
+// the reader.
+func (m *ManifestTestSuite) writeManifestWithDistinctCounts(version int)
map[int]int64 {
+ partitionSpec := NewPartitionSpecID(0)
+ snapshotID := int64(12345678)
+ seqNum := int64(9876)
+
+ dataFileBuilder, err := NewDataFileBuilder(
+ partitionSpec,
+ EntryContentData,
+ "s3://bucket/ns/table/data/distinct.parquet",
+ ParquetFile,
+ map[int]any{},
+ map[int]string{},
+ map[int]int{},
+ 10,
+ 10*1000,
+ )
+ m.Require().NoError(err)
+ dataFileBuilder.DistinctValueCounts(map[int]int64{1: 42})
+
+ var buf bytes.Buffer
+ file, err := WriteManifest(
+ "s3://bucket/ns/table/metadata/distinct.avro", &buf, version,
+ partitionSpec,
+ NewSchema(0,
+ NestedField{ID: 1, Name: "id", Type: Int64Type{},
Required: true},
+ ),
+ snapshotID,
+ []ManifestEntry{NewManifestEntry(
+ EntryStatusADDED,
+ &snapshotID,
+ &seqNum, &seqNum,
+ dataFileBuilder.Build(),
+ )},
+ )
+ m.Require().NoError(err)
+
+ entries, err := ReadManifest(file, &buf, false)
+ m.Require().NoError(err)
+ m.Require().Len(entries, 1)
+
+ return entries[0].DataFile().DistinctValueCounts()
+}
+
+// TestWriteManifestV3OmitsDistinctCounts verifies that v3 manifest writers do
+// not surface data_file.distinct_counts on the read side (deprecated by the v3
+// spec; parity with Java apache/iceberg#12182). Today this passes because none
+// of the per-version Avro record schemas in internal/avro_schemas.go declare
+// a distinct_counts field, so the field is dropped at encode time on every
+// version. The v3writerImpl.prepareEntry guard that clears DistinctCounts is
+// forward-compatibility insurance: if the v2 schema is later fixed to declare
+// distinct_counts (the v2 spec requires it; tracked separately), this test
+// keeps the v3 path compliant without further code changes.
Review Comment:
Same as the manifest.go comment — most of this duplicates the PR body, and a
chunk of it ("the v2 schema is later fixed... this test keeps the v3 path
compliant") is only meaningful relative to #1038.
I'd trim to one sentence:
```go
// v3 manifests must not surface data_file.distinct_counts (deprecated by v3
spec;
// Java parity: apache/iceberg#12182). v2 round-trip will be added with
#1038.
```
##########
manifest.go:
##########
@@ -944,6 +944,19 @@ func (v3writerImpl) prepareEntry(entry *manifestEntry,
snapshotID int64) (Manife
}
}
+ // v3 spec deprecates data_file.distinct_counts (parity with Java
+ // apache/iceberg#12182). Today none of the per-version Avro schemas in
+ // internal/avro_schemas.go declare distinct_counts, so the field is
+ // already dropped on encode for every version. This guard is
+ // forward-compatibility insurance: if the v2 schema is later fixed to
+ // declare distinct_counts (v2 spec requires it; tracked separately),
the
+ // v3 path stays compliant by clearing the Avro-facing pointer here. The
+ // cached distinctCntMap and DistinctValueCounts() getter are left alone
+ // so in-memory readers in the same process keep their view.
Review Comment:
This 8-line comment is essentially the PR body. The reference to
`internal/avro_schemas.go` will rot (line numbers, file path) and the
schema-already-drops-it framing is context for the review, not for future
readers of the file.
I'd condense to two lines:
```go
// v3 spec deprecates data_file.distinct_counts (Java parity:
apache/iceberg#12182).
// Defensive: clears the Avro-facing pointer; cached distinctCntMap and
DistinctValueCounts() getter
// are intentionally preserved so in-process readers keep their view.
```
Link #1038 in the commit message rather than the source if you want a
forward pointer.
##########
manifest_test.go:
##########
@@ -2115,6 +2115,67 @@ func (m *ManifestTestSuite)
TestManifestRoundTripSortOrderID() {
m.Equal(expectedSortOrderID, *got)
}
+// writeManifestWithDistinctCounts builds a single-entry manifest at the given
+// version with distinct_counts populated for one column, round-trips it
through
+// WriteManifest + ReadManifest, and returns the distinct counts observed by
+// the reader.
+func (m *ManifestTestSuite) writeManifestWithDistinctCounts(version int)
map[int]int64 {
Review Comment:
The `version int` parameter is dead — only `3` is ever passed, and the v2
case the helper anticipates is correctly deferred to #1038.
I'd either inline this into `TestWriteManifestV3OmitsDistinctCounts` (it's
only ~20 lines, only one call site) or leave the helper as-is and add the v2
caller in the #1038 PR. Building the abstraction now without a second user
reads as premature.
While we're here: `snapshotID := 12345678`, `seqNum := 9876`, `recordCount:
10`, `fileSize: 10*1000` — none of those values are checked. A literal `1`
everywhere would be clearer noise.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]