tanmayrauth commented on PR #1099:
URL: https://github.com/apache/iceberg-go/pull/1099#issuecomment-4505129784
Thanks for the thorough pass, addressed all 12 inline comments. Going
through them in order:
**1. `metadata_columns.go` : `SchemaWithRowID` panics on duplicate +
aliases the input slice.**
Renamed to `SchemaWithRowLineage`. `slices.Clone`'d the field slice up
front so the returned schema cannot share a backing array with the input, and
the function skips appending when a reserved field ID (`_row_id` or
`_last_updated_sequence_number`) is already present. New test
`TestProjectionV3SchemaAlreadyHasRowID` locks in the no-panic / no-duplicate-id
behavior.
**2. `table/transaction.go` : `_last_updated_sequence_number` left null
breaks CDC semantics.**
Agreed, fixed. `rewriteSingleFile` now projects
`_last_updated_sequence_number` via `SchemaWithRowLineage`, and
`scanTask.DataSequenceNumber` is wired in from the source manifest entry's
`FileSequenceNum` (threaded through a new `fileSeqByPath` map returned from
`classifyFilesForFilteredDeletions`). The arrow scanner's
`synthesizeRowLineageColumns` keeps any explicit per-row value from the source
file and only synthesizes from the file's `data_sequence_number` for null
source rows — so a pure CoW rewrite preserves the original "last logically
updated" sequence per row instead of stamping the new commit's sequence onto
every survivor. `TestCoWRewritePreservesRowID` was updated to assert this
directly.
**3. `table/rewrite_data_files.go` : `hasRowLineage` permits mixed
groups.**
Tightened to `allTasksHaveRowLineage` (renamed function in the other
thread too). When even one task in the group lacks `FirstRowID`, the entire
group skips the lineage path — avoids the "some explicit `_row_id`s, some
nulls" output you flagged. Splitting mixed groups so each side rewrites with
its own semantics is the cleaner long-term fix but requires changing
`ExecuteCompactionGroup`'s output shape ,left as follow-up; called it out in
the gating comment so we don't lose the thread.
**4. `table/table.go` : `WithRowLineage` doc says "ignored on non-v3" but
impl projects regardless.**
`Projection()` now returns an explicit error when `includeRowLineage` is
set on a v1/v2 table, and the `WithRowLineage` doc was rewritten to match. New
test `TestProjectionWithRowLineageRequiresV3` covers it.
**5. `table/table.go` : `WithRowLineage` name suggests whole-feature
control.**
Went with the "extend it" branch you suggested: `SchemaWithRowLineage`
(the helper `WithRowLineage` calls) now projects both `_row_id` and
`_last_updated_sequence_number`. So one switch ↔ both lineage columns ↔
preservation through rewrite. Kept the name `WithRowLineage` since the
semantics now match it.
**6. `table/scanner.go` : losing the `Select("_row_id")` escape hatch.**
Restored. `Projection()` now intercepts row-lineage column names
(case-aware) in the user's `selectedFields`, strips them before
`curSchema.Select`, and appends the corresponding `RowID()` /
`LastUpdatedSequenceNumber()` fields back onto the projected schema — but only
on v3 tables. On v1/v2 the names fall through to `curSchema.Select`, which
fails as before with `ErrInvalidSchema`. Restored
`TestProjectionV3SelectRowLineageColumns` and
`TestProjectionRowLineageRejectedOnV1V2` to lock both behaviors in.
**7. `table/transaction.go` : `filterRecordBatch` rebinds per batch +
`AlwaysFalse` returns schemaless record.**
once for the whole rewrite (the complement filter doesn't change between
files) and passes the resulting closure into every `rewriteSingleFile` call.
The `AlwaysFalse` fast-path now returns `rec.NewSlice(0, 0)` so downstream
code that calls `rec.Column(i)` doesn't panic on the empty result.
**8. `table/transaction.go` : `AlwaysTrue` disables pushdown for the
rewrite scan.**
Left a `TODO(perf)` comment on `rewriteSingleFile` calling out the IO cost
of disabling pushdown when preserving lineage, and noted that file-level
skipping (against the bound filter's residual on file stats) could be
re-enabled here without breaking `_row_id` synthesis since synthesis only
depends on within-file row positions. Punting the actual implementation to a
follow-up — agreed it's worth fixing but didn't want to expand this PR into a
metrics-eval refactor.
**9. `table/rewrite_data_files.go` : `hasRowLineage` comment claims a
table property but checks per-task.**
Renamed to `allTasksHaveRowLineage` and rewrote the comment. The function
now also returns `false` on an empty task list (defensive), and the call-site
comment explains why we gate on "all" rather than "any" (mixed groups →
per-file invariant violation, see thread 3).
**10. `table/write_records.go` : `WithPreserveRowLineage` accepts
anything.**
Added validation in `WriteRecords`: errors when the table format version
is below 3, and errors when the input arrow schema doesn't include the
`_row_id` column. Docstring rewritten to call this out. Implementation-wise
this required a small reshuffle: `cfg` is now built before
`checkArrowSchemaCompat` runs so the compat check can be against the projected
(with-lineage) schema rather than `tbl.Schema()` — otherwise the compat check
itself would reject `_row_id` as unknown.
**11. `table/row_lineage_rewrite_test.go` : `GreaterOrEqual(5)` is
vacuous.**
Tightened to `assert.Equal(t, int64(5), tbl.Metadata().NextRowID())`. The
comment above the assertion already had the exact expected value (3 + 2), so
the loose form was just hiding regressions in either direction.
**12. `table/snapshot_producers.go` : `ManifestListWriter.V4Writer`
doesn't exist.**
Typo, fixed to `V3Writer`. Same fix applied to the matching comment in the
test that referenced V4Writer for context.
---
**Newly added tests:**
- `TestProjectionV3SelectRowLineageColumns` : `Select("_row_id")` works
on v3
- `TestProjectionRowLineageRejectedOnV1V2` : same select on v1/v2 errors
with `ErrInvalidSchema`
- `TestProjectionWithRowLineageRequiresV3` : `WithRowLineage` errors on
v1/v2
- `TestProjectionV3SchemaAlreadyHasRowID` : idempotency under reserved-id
collision
- `TestExecuteCompactionGroupPreservesRowID` : end-to-end compaction on a
v3 table preserves `_row_id` for every row across two source files
PTAL when you have a moment.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]