laskoviymishka commented on code in PR #1118:
URL: https://github.com/apache/iceberg-go/pull/1118#discussion_r3288292110
##########
table/updates.go:
##########
@@ -506,6 +506,27 @@ func (u *removeSnapshotsUpdate) PostCommit(ctx
context.Context, preTable *Table,
}
}
+ // Manifest paths kept alive by retained snapshots, plus their
+ // loaded manifest-file slices so the live-data-file pass below
+ // doesn't re-download each manifest list.
+ retainedManifests := make(map[string]struct{})
+ retainedSnapshotManifests := make(map[int64][]iceberg.ManifestFile)
Review Comment:
the snapshot-id key here is never read — the only consumer below ranges `for
_, mans := range retainedSnapshotManifests`. I'd drop the map and either
flatten while we're collecting or keep a `[][]iceberg.ManifestFile` so the
walkedRetained pass below isn't paying for a map keyed on a field nobody uses.
while we're here, the inner loop already gives us each `man` once per
`(snap, man)` pair — we could just dedupe by `man.FilePath()` during this pass
and hand the second pass a flat `[]iceberg.ManifestFile` of unique manifests,
which also removes the need for `walkedRetained` below. wdyt?
##########
table/updates.go:
##########
@@ -506,6 +506,27 @@ func (u *removeSnapshotsUpdate) PostCommit(ctx
context.Context, preTable *Table,
}
}
+ // Manifest paths kept alive by retained snapshots, plus their
Review Comment:
small nit: "doesn't re-download each manifest list" reads like the whole
function skips manifest-list reads, but the expired loop below still calls
`snap.Manifests(prefs)` per expired snapshot. I'd tighten to something like `//
Preload retained manifest lists once so the live-data-file pass below can walk
them without re-fetching.` to keep the scope honest.
##########
table/updates.go:
##########
@@ -529,14 +560,17 @@ func (u *removeSnapshotsUpdate) PostCommit(ctx
context.Context, preTable *Table,
}
}
- for _, snap := range postTable.Metadata().Snapshots() {
- mans, err := snap.Manifests(prefs)
- if err != nil {
- return err
- }
-
+ // Keep files still referenced (non-DELETED) by retained manifests,
+ // including data files carried forward as EXISTING by manifest
+ // merges. Each retained manifest is walked at most once.
+ walkedRetained := make(map[string]struct{}, len(retainedManifests))
Review Comment:
minor perf regression in the "nothing to delete" path: when `filesToDelete`
is empty (no expired snapshots had unique manifests), this pass still opens
every retained manifest and iterates entries to call `delete` on a no-op key.
I'd guard the whole retained walk with `if len(filesToDelete) > 0`. Restores
the short-circuit and keeps the PR's spirit of "open each manifest at most
once, and only when it matters".
##########
table/updates.go:
##########
@@ -506,6 +506,27 @@ func (u *removeSnapshotsUpdate) PostCommit(ctx
context.Context, preTable *Table,
}
}
+ // Manifest paths kept alive by retained snapshots, plus their
+ // loaded manifest-file slices so the live-data-file pass below
+ // doesn't re-download each manifest list.
+ retainedManifests := make(map[string]struct{})
+ retainedSnapshotManifests := make(map[int64][]iceberg.ManifestFile)
+ for _, snap := range postTable.Metadata().Snapshots() {
+ mans, err := snap.Manifests(prefs)
+ if err != nil {
+ return err
+ }
+ retainedSnapshotManifests[snap.SnapshotID] = mans
+ for _, man := range mans {
+ retainedManifests[man.FilePath()] = struct{}{}
+ }
+ }
+
+ // Open each orphaned manifest at most once: skip manifests that
+ // retained snapshots still reference, and dedupe across expired
+ // snapshots that share manifests by reference.
+ visitedManifests := make(map[string]struct{})
Review Comment:
the dedup invariant this PR is built on isn't covered by any test.
`TestRemoveSnapshotsPostCommitDeletesStatisticsFiles` and
`TestRemoveSnapshotsPostCommitPreservesStatisticsOfSurvivingSnapshots` both use
snapshots with empty manifest lists, so `snap.Manifests()` returns `[]` and
neither of the loops you're changing here ever executes.
Adding two cases against a trackingIO-style fs would lock this in: (a)
manifest M shared between an expired snapshot and a retained one, asserting M
and its data files survive; (b) manifest M shared between two expired
snapshots, asserting M+its data files are deleted exactly once and `Entries()`
is called once. The second one also pins the perf claim.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]