paveon opened a new pull request, #1118: URL: https://github.com/apache/iceberg-go/pull/1118
Fixes https://github.com/apache/iceberg-go/issues/1117 ### What changed Reworked `removeSnapshotsUpdate.PostCommit` so each unique manifest file is opened at most once per call, regardless of how many expired or retained snapshots reference it. Two passes, both deduped: 1) Build the set of manifest paths reachable from any retained snapshot, reading only manifest-lists. Cache the resulting `[]ManifestFile` per snapshot so the retained-side pass below doesn't re-download each list. 2) Walk expired snapshots' manifest lists; for each manifest, skip if it's in the retained set (its data files are live by definition and the manifest itself must not be deleted) or if a prior expired snapshot already enumerated it. Otherwise read its entries once. 3) Subtract live data files via a single walk over each unique retained manifest. DELETED entries remain tombstones (unchanged from prior semantics). ### Behavior Semantically equivalent to the previous implementation — the final `filesToDelete` set is the same on well-formed metadata. No spec change, no API change. The only difference is the I/O cost. ### Performance impact For a 491-snapshot incremental-append table where expiring 490 snapshots previously triggered ~sum(1..490) ≈ 120,000 manifest-file downloads, the rewrite reduces that to roughly the count of unique orphaned manifests (a few hundred in practice). Two-to-three orders of magnitude fewer object-store reads, in our test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
