twuebi opened a new pull request, #1075:
URL: https://github.com/apache/iceberg-go/pull/1075
Adds on-the-wire encoding for DataFile and FileScanTask to enable
distributed compaction. With #1033, we decomposed compaction into a coordinator
and worker portion. This enables plumbing them together over the wire.
FileScanTask need to go coordinator -> worker (data file, delete files, scan
range + v3 row lineage) and DataFile from worker -> coordinator collecting them
for the commit.
iceberg.EncodeDataFile / DecodeDataFile reuse the manifest-entry Avro
encoding so the bytes on the wire are the same bytes a manifest carries. The
dataFile struct's avro tags remain the single source of truth. Adding a field
to DataFile extends what the helpers transport, with no wire mirror to keep in
sync. Encoding is non-mutating and thread-safe: a fresh *dataFile is cloned via
reflection over avro tags, so the source is untouched.
table.EncodeFileScanTask / DecodeFileScanTask layer on top: each embedded
DataFile is iceberg-encoded, then wrapped alongside the scan range and v3 row
lineage in a small Avro envelope.
Design notes:
- The receiver supplies (spec, schema, version) out of band. Both sides in
the distributed-compaction design already hold table metadata, and the
per-(specID, version) avro schema is cached. Happy to switch to a
self-describing payload if preferred.
- distinct_counts is not transported, the iceberg-go manifest-entry schema
doesn't carry it on any version. Callers that need it must transport it
separately.
- Reflection runs once per encode and is not in a hot path I can see. Happy
to precompute the avro field index at init if preferred.
- An anonymous `var _ = FileScanTask{...}` literal next to the codec is a
compile-time drift guard: adding/retyping/reordering a FileScanTask field
breaks the build, forcing a deliberate call on whether it must cross the wire.
Format versions 1, 2, and 3 are supported. No change to existing manifest
read/write paths.
Tests: round-trip across v1/v2/v3 with fully populated DataFiles,
foreign-impl rejection, partition-data idempotence, and FileScanTask shape
including v3 row lineage.
cc @laskoviymishka continues building blocks for distributed compaction
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]