eholmer-pltr opened a new pull request, #16833: URL: https://github.com/apache/iceberg/pull/16833
## What `HistoryTable`, `SnapshotsTable`, `MetadataLogEntriesTable`, `RefsTable`, `ManifestsTable`, `AllManifestsTable`, and `PartitionsTable` build a `StaticDataTask` whose rows are materialized from in-memory `TableMetadata` (or from pre-walked manifest data). The `DataFile` these tasks wrap exists only so `FileScanTask.file()` can return a path for identification — the file bytes are never read. The previous `StaticDataTask.of(InputFile, ...)` entry point built that `DataFile` via `DataFiles.Builder.withInputFile`, which calls `InputFile.getLength()` to populate `fileSizeInBytes`. On object-store `FileIO` implementations that is a `HEAD`/`GetObject` round-trip against the table's `metadata.json` — one extra request on every metadata-table scan, for a value that is never consulted on the `DataTask` read path. ## Change Replace `StaticDataTask.of(InputFile, ...)` with `StaticDataTask.of(String location, ...)`, which builds the synthetic `DataFile` via `withPath(location)` + `withFileSizeInBytes(0L)`. The seven metadata tables and `TestDataTaskParser` are migrated to it, and the old overload and its private constructor — the only remaining `getLength()` call site — are removed, so no construction path performs I/O against the synthetic location. `FileScanTask.length()` for these tasks now returns `0` instead of the `metadata.json` size. ## Why this is safe `fileSizeInBytes` is never read for a `DataTask`. Engines branch on `ScanTask.isDataTask()` and obtain rows via `DataTask.rows()`; any file-size-aware logic (buffer sizing, split planning) lives on the `!isDataTask()` path — e.g. `RowDataReader.open` in iceberg-spark and `DataTaskReader.open` in iceberg-flink. There is existing precedent in the same area: `AllManifestsTable.ManifestListReadTask.length()` already returns a hard-coded `8192` with the comment _"return a generic length to avoid looking up the actual length"_. `StaticDataTask` is package-private, so this is not a public API change (no revapi impact). ## Related discussion This is a small, concrete step in a direction the community has been discussing — reducing engines' dependence on the root `metadata.json` and the code paths that read it directly: - [DISCUSS] metadata.json in v4? (dev@, Anton Okolnychyi) — making the root `metadata.json` optional: https://lists.apache.org/thread/l1onvv5p5cq2vtql9g8w5fbxc1hhd65l - [DISCUSS] Offloading Snapshots from Metadata.json (dev@) — includes removing engine paths that read the metadata file directly: https://lists.apache.org/thread/5f5p4rcxyq7j6rt7yts68shoqmm75478 See also Yufei Gu's doc tracking clients/engines with hard dependencies on the metadata file in storage, shared in that thread: https://docs.google.com/document/d/17PBhJ0IBxHxMKvCW6CstGOp7cZnboMDdpO6BCPO2kmA/edit The metadata tables never need any bytes from `metadata.json` — only a path for identification — yet today they issue a `HEAD` against it solely to populate an unused size field. This removes that one unnecessary read; it doesn't attempt the broader changes proposed in those threads. ## Testing - New `TestStaticDataTask` pins the contract: a bogus, never-created location is tolerated with no I/O, `length()` and `DataFile.fileSizeInBytes()` are `0`, format is `METADATA`, and the supplied path is preserved for identification. - `TestDataTaskParser` is migrated to the string overload; the serialized round-trip is byte-identical (the previous `Files.localInput(...)` already reported size `0` and stripped the URI scheme). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
