laskoviymishka commented on code in PR #985:
URL: https://github.com/apache/iceberg-go/pull/985#discussion_r3188889334
##########
manifest.go:
##########
@@ -463,6 +481,13 @@ type ManifestFile interface {
HasAddedFiles() bool
// HasExistingFiles returns true if ExistingDataFiles > 0 or if it was
null.
HasExistingFiles() bool
+ // Entries streams the manifest entries from the manifest file using
+ // the provided file system IO interface. Entries that have been
+ // marked as deleted are skipped if discardDeleted is true.
+ //
+ // Prefer Entries over FetchEntries when walking large manifests
+ // since it avoids loading every entry into memory at once.
+ Entries(fs iceio.IO, discardDeleted bool) iter.Seq2[ManifestEntry,
error]
Review Comment:
Adding `Entries` to the public `ManifestFile` interface is a breaking change
for any external implementer or mock — they'll fail to compile against the next
module bump. That's probably acceptable for a 0.x module, but I'd want it
called out: at minimum a release-note entry, plus a `// Deprecated:` line on
`FetchEntries` if the intent is to phase it out.
If we'd rather keep the interface stable, `Entries` could live as a method
on the concrete `*manifestFile` only, or as a free helper
`iceberg.IterManifestEntries(m, fs, discardDeleted)`. What's the preferred
direction here?
##########
manifest.go:
##########
@@ -339,8 +340,35 @@ func (m *manifestFile) FirstRowID() *int64 { return
m.FirstRowIDValue }
func (m *manifestFile) HasAddedFiles() bool { return m.AddedFilesCount != 0
}
func (m *manifestFile) HasExistingFiles() bool { return m.ExistingFilesCount
!= 0 }
-func (m *manifestFile) FetchEntries(fs iceio.IO, discardDeleted bool)
([]ManifestEntry, error) {
- return fetchManifestEntries(m, fs, discardDeleted)
+
+func (m *manifestFile) Entries(fs iceio.IO, discardDeleted bool)
iter.Seq2[ManifestEntry, error] {
+ return func(yield func(ManifestEntry, error) bool) {
+ f, err := fs.Open(m.FilePath())
+ if err != nil {
+ yield(nil, err)
+
+ return
+ }
+ defer func() {
+ _ = f.Close()
Review Comment:
The streaming path drops Close errors here (`_ = f.Close()`), while
`FetchEntries` below uses `internal.CheckedClose` to surface them via the named
return. For object-store IO wrappers that flush metrics or surface deferred
errors via Close, that's a real signal we'd lose as callers migrate from one to
the other.
Could we either yield the close error as a final synthetic `(nil, cerr)`
pair, or at minimum call out the asymmetry in both godocs? Same applies to `_ =
manifestReader.Close()` inside `IterManifest`.
##########
manifest.go:
##########
@@ -751,33 +776,54 @@ func (c *ManifestReader) ReadEntry() (ManifestEntry,
error) {
return tmp, nil
}
+// IterManifest returns an iterator that streams manifest entries from
+// the provided reader without buffering them. If discardDeleted is true,
+// entries whose status is "deleted" are skipped.
+func IterManifest(m ManifestFile, f io.Reader, discardDeleted bool)
iter.Seq2[ManifestEntry, error] {
Review Comment:
`IterManifest` ends up with one in-tree caller (`ReadManifest`), and the
method form `Entries` already owns the file lifecycle for everyone else.
Exporting it locks in a third public way to read entries (alongside `Entries`
and `FetchEntries`) without an obvious external use case — and the doc-comment
doesn't pin down the contract (single-shot? error yields once and stops? close
errors swallowed?).
Could we lower-case to `iterManifest` unless there's a known consumer that
needs the bare-`io.Reader` form? If it stays exported, worth spelling out those
guarantees in the godoc.
##########
cmd/iceberg/output.go:
##########
@@ -150,12 +150,11 @@ func (t textOutput) Files(tbl *table.Table, history bool)
{
snapshotTree = append(snapshotTree,
pterm.LeveledListItem{
Level: 1, Text: "Manifest: " + m.FilePath(),
})
- datafiles, err := m.FetchEntries(afs, false)
- if err != nil {
- t.Error(err)
- os.Exit(1)
- }
- for _, e := range datafiles {
+ for e, err := range m.Entries(afs, false) {
+ if err != nil {
+ t.Error(err)
+ os.Exit(1)
Review Comment:
With the slice-based `FetchEntries` the file was already closed before this
branch ran. Now the file is open for the duration of iteration, so `os.Exit(1)`
here skips the deferred `f.Close()` inside `Entries`. Cosmetic for a
short-lived CLI but a new leak shape worth fixing while we're here — break out
of the loop and exit after the range, or refactor `Files` to return an error.
##########
table/scanner.go:
##########
@@ -153,13 +153,12 @@ func GetPartitionRecord(dataFile iceberg.DataFile,
partitionType *iceberg.Struct
func openManifest(io io.IO, manifest iceberg.ManifestFile,
partitionFilter, metricsEval func(iceberg.DataFile) (bool, error),
) ([]iceberg.ManifestEntry, error) {
- entries, err := manifest.FetchEntries(io, true)
- if err != nil {
- return nil, err
- }
+ var out []iceberg.ManifestEntry
Review Comment:
We lost the `make([]iceberg.ManifestEntry, 0, len(entries))` preallocation
here. `openManifest` is on the scan hot path, so for wide manifests this trades
one fixed allocation for several `append` regrowths.
Worth pre-sizing from the manifest counts — `int(manifest.AddedDataFiles())
+ int(manifest.ExistingDataFiles())` is a defensible upper bound since
`discardDeleted=true`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]