eholmer-pltr opened a new pull request, #16833:
URL: https://github.com/apache/iceberg/pull/16833

   ## What
   
   `HistoryTable`, `SnapshotsTable`, `MetadataLogEntriesTable`, `RefsTable`, 
`ManifestsTable`, `AllManifestsTable`, and `PartitionsTable` build a 
`StaticDataTask` whose rows are materialized from in-memory `TableMetadata` (or 
from pre-walked manifest data). The `DataFile` these tasks wrap exists only so 
`FileScanTask.file()` can return a path for identification — the file bytes are 
never read.
   
   The previous `StaticDataTask.of(InputFile, ...)` entry point built that 
`DataFile` via `DataFiles.Builder.withInputFile`, which calls 
`InputFile.getLength()` to populate `fileSizeInBytes`. On object-store `FileIO` 
implementations that is a `HEAD`/`GetObject` round-trip against the table's 
`metadata.json` — one extra request on every metadata-table scan, for a value 
that is never consulted on the `DataTask` read path.
   
   ## Change
   
   Replace `StaticDataTask.of(InputFile, ...)` with `StaticDataTask.of(String 
location, ...)`, which builds the synthetic `DataFile` via `withPath(location)` 
+ `withFileSizeInBytes(0L)`. The seven metadata tables and `TestDataTaskParser` 
are migrated to it, and the old overload and its private constructor — the only 
remaining `getLength()` call site — are removed, so no construction path 
performs I/O against the synthetic location. `FileScanTask.length()` for these 
tasks now returns `0` instead of the `metadata.json` size.
   
   ## Why this is safe
   
   `fileSizeInBytes` is never read for a `DataTask`. Engines branch on 
`ScanTask.isDataTask()` and obtain rows via `DataTask.rows()`; any 
file-size-aware logic (buffer sizing, split planning) lives on the 
`!isDataTask()` path — e.g. `RowDataReader.open` in iceberg-spark and 
`DataTaskReader.open` in iceberg-flink.
   
   There is existing precedent in the same area: 
`AllManifestsTable.ManifestListReadTask.length()` already returns a hard-coded 
`8192` with the comment _"return a generic length to avoid looking up the 
actual length"_.
   
   `StaticDataTask` is package-private, so this is not a public API change (no 
revapi impact).
   
   ## Related discussion
   
   This is a small, concrete step in a direction the community has been 
discussing — reducing engines' dependence on the root `metadata.json` and the 
code paths that read it directly:
   
   - [DISCUSS] metadata.json in v4? (dev@, Anton Okolnychyi) — making the root 
`metadata.json` optional: 
https://lists.apache.org/thread/l1onvv5p5cq2vtql9g8w5fbxc1hhd65l
   - [DISCUSS] Offloading Snapshots from Metadata.json (dev@) — includes 
removing engine paths that read the metadata file directly: 
https://lists.apache.org/thread/5f5p4rcxyq7j6rt7yts68shoqmm75478
   
   See also Yufei Gu's doc tracking clients/engines with hard dependencies on 
the metadata file in storage, shared in that thread: 
https://docs.google.com/document/d/17PBhJ0IBxHxMKvCW6CstGOp7cZnboMDdpO6BCPO2kmA/edit
   
   The metadata tables never need any bytes from `metadata.json` — only a path 
for identification — yet today they issue a `HEAD` against it solely to 
populate an unused size field. This removes that one unnecessary read; it 
doesn't attempt the broader changes proposed in those threads.
   
   ## Testing
   
   - New `TestStaticDataTask` pins the contract: a bogus, never-created 
location is tolerated with no I/O, `length()` and `DataFile.fileSizeInBytes()` 
are `0`, format is `METADATA`, and the supplied path is preserved for 
identification.
   - `TestDataTaskParser` is migrated to the string overload; the serialized 
round-trip is byte-identical (the previous `Files.localInput(...)` already 
reported size `0` and stripped the URI scheme).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to