zeroshade opened a new pull request, #1114:
URL: https://github.com/apache/iceberg-go/pull/1114
Restores main CI to green by reverting only the test-expectation byte counts
that #1102 ("fix(manifest): stop writing deprecated distinct_counts")
inadvertently changed.
## What broke
After #1102 merged to main, both the `Go` and `Audit and Verify` workflows
started failing on every push, across all 6 matrix entries
(ubuntu/windows/macos × Go 1.25.5/1.26.1):
```
--- FAIL: TestTableWriting (10.16s)
--- FAIL: TestTableWriting/TestAddFilesPartitionedTable
--- FAIL: TestTableWriting/TestAddFilesUnpartitioned
--- FAIL: TestTableWriting/TestReplaceDataFiles
--- FAIL: TestTableWriting/TestAddFilesPartitionedTable#01
--- FAIL: TestTableWriting/TestAddFilesUnpartitioned#01
--- FAIL: TestTableWriting/TestReplaceDataFiles#01
--- FAIL: TestPositionDeletePartitionedFanoutWriterProcessBatch (0.00s)
--- FAIL: TestPositionDeletePartitionedFanoutWriterProcessBatch/success
--- FAIL:
TestPositionDeletePartitionedFanoutWriterProcessBatch/batch_with_records_having_different_file_paths
```
with diffs like:
```
- "added-files-size": "3590" (actual on CI / arrow-go v18.6.0)
+ "added-files-size": "3070" (test source, post-#1102)
- ColumnSizes: 2147483545:88 (actual on CI / arrow-go v18.6.0)
+ ColumnSizes: 2147483545:86 (test source, post-#1102)
```
## Root cause
These two test files assert on **exact byte counts** of the on-disk parquet
files written by `arrow-go`. Those byte counts depend on the parquet writer's
metadata encoding, which differs across arrow-go versions.
#1102 lowered the expected counts (3590→3070, 1066→963, 1816→2132+→1816,
4264→3687, 88→86, 174→172, 96→94, 187→185). Those new values only reproduce
against an unreleased arrow-go build (presumably wired in via a local
`go.work`). Against the pinned `github.com/apache/arrow-go/v18 v18.6.0` from
`go.sum` — which every CI runner resolves — the writer still emits the original
larger sizes, so the assertions fail every time.
The PR's own CI runs were `CANCELLED` before completion, so this didn't
surface at merge time.
The functional change in #1102 (dropping `distinct_counts` field-id 111 from
the manifest entry Avro schema) only alters **manifest** serialization, not the
**data** parquet files whose sizes these `Summary`/`ColumnSizes` fields tally —
so the test-value updates were always unrelated to the production fix and can
be safely reverted on their own.
## Fix
Revert only the eight byte-count lines back to the pre-#1102 values verified
via `git show 51b3140^`:
| File | Test | Field | Before | #1102 | This PR |
|---|---|---|---|---|---|
| `table/table_test.go:549,554` | `TestAddFilesUnpartitioned` |
`added-/total-files-size` | 3590 | 3070 | **3590** |
| `table/table_test.go:770,776` | `TestAddFilesPartitionedTable` |
`added-/total-files-size` | 3590 | 3070 | **3590** |
| `table/table_test.go:1136,1140,1144` | `TestReplaceDataFiles` |
`added-/removed-/total-files-size` | 1066/2132/4264 | 963/1816/3687 |
**1066/2132/4264** |
| `table/pos_delete_partitioned_fanout_writer_test.go:77` | `success` |
`ColumnSizes` | 88,174 | 86,172 | **88,174** |
| `table/pos_delete_partitioned_fanout_writer_test.go:87` |
`batch_with_records_having_different_file_paths` | `ColumnSizes` | 96,187 |
94,185 | **96,187** |
No production code changes — #1102's manifest fix is preserved intact.
## Verification
Reproduced CI failure locally with `GOWORK=off go test ./...` (forces use of
pinned v18.6.0):
- **Before this PR**: same 8 sub-test failures with identical
expected/actual diffs as CI.
- **After this PR**: `ok` across every package — `iceberg-go`,
`catalog/{glue,hadoop,hive,rest,sql}`, `cmd/iceberg`, `codec`, `config`,
`internal`, `io`, `io/gocloud`, `puffin`, `table`,
`table/{compaction,dv,internal,substrait}`, `view`, `view/internal`.
- `go vet ./...` clean, `go build ./...` clean, LSP diagnostics clean on
both touched files.
## Followup (out of scope)
These tests are inherently brittle — they'll keep breaking on every arrow-go
bump that nudges parquet metadata encoding. A future cleanup could replace
exact byte assertions with bounds (`> 0`, monotonic relationships between
added/removed/total) or assert on row counts only. Not addressing here to keep
the diff minimal and unblock main.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]