tanmayrauth commented on PR #1099:
URL: https://github.com/apache/iceberg-go/pull/1099#issuecomment-4505129784

   Thanks for the thorough pass,  addressed all 12 inline comments. Going 
through them in order:
   
     **1. `metadata_columns.go` : `SchemaWithRowID` panics on duplicate + 
aliases the input slice.**
     Renamed to `SchemaWithRowLineage`. `slices.Clone`'d the field slice up 
front so the returned schema cannot share a backing array with the input, and 
the function skips appending when a reserved field ID (`_row_id` or  
`_last_updated_sequence_number`) is already present. New test 
`TestProjectionV3SchemaAlreadyHasRowID` locks in the no-panic / no-duplicate-id 
behavior.
   
     **2. `table/transaction.go`  : `_last_updated_sequence_number` left null 
breaks CDC semantics.**
     Agreed, fixed. `rewriteSingleFile` now projects 
`_last_updated_sequence_number` via `SchemaWithRowLineage`, and 
`scanTask.DataSequenceNumber` is wired in from the source manifest entry's 
`FileSequenceNum`  (threaded through a new `fileSeqByPath` map returned from 
`classifyFilesForFilteredDeletions`). The arrow scanner's 
`synthesizeRowLineageColumns` keeps any explicit per-row value from the source 
file and  only synthesizes from the file's `data_sequence_number` for null 
source rows — so a pure CoW rewrite preserves the original "last logically 
updated" sequence per row instead of stamping the new commit's  sequence onto 
every survivor.  `TestCoWRewritePreservesRowID` was updated to assert this 
directly.
     
     **3. `table/rewrite_data_files.go`  : `hasRowLineage` permits mixed 
groups.**
     Tightened to `allTasksHaveRowLineage` (renamed function in the other 
thread too). When even one task in the group lacks `FirstRowID`, the entire 
group skips the lineage path — avoids the "some explicit  `_row_id`s, some 
nulls" output you flagged. Splitting mixed groups so each side rewrites with 
its own semantics is the cleaner long-term fix but requires changing 
`ExecuteCompactionGroup`'s output shape ,left as follow-up; called it out in 
the gating comment so we don't lose the thread.
   
     **4. `table/table.go`  : `WithRowLineage` doc says "ignored on non-v3" but 
impl projects regardless.**
     `Projection()` now returns an explicit error when `includeRowLineage` is 
set on a v1/v2 table, and the  `WithRowLineage` doc was rewritten to match. New 
test `TestProjectionWithRowLineageRequiresV3` covers  it.
   
     **5. `table/table.go`  : `WithRowLineage` name suggests whole-feature 
control.**
     Went with the "extend it" branch you suggested: `SchemaWithRowLineage` 
(the helper `WithRowLineage` calls) now projects both `_row_id` and 
`_last_updated_sequence_number`. So one switch ↔ both lineage  columns ↔ 
preservation through rewrite. Kept the name `WithRowLineage` since the 
semantics now match it.
   
     **6. `table/scanner.go`  : losing the `Select("_row_id")` escape hatch.**
     Restored. `Projection()` now intercepts row-lineage column names 
(case-aware) in the user's `selectedFields`, strips them before 
`curSchema.Select`, and appends the corresponding `RowID()` /  
`LastUpdatedSequenceNumber()` fields back onto the projected schema — but only 
on v3 tables. On v1/v2 the names fall through to `curSchema.Select`, which 
fails as before with `ErrInvalidSchema`. Restored 
`TestProjectionV3SelectRowLineageColumns` and 
`TestProjectionRowLineageRejectedOnV1V2` to lock both behaviors in.
   
     **7. `table/transaction.go`  : `filterRecordBatch` rebinds per batch + 
`AlwaysFalse` returns schemaless record.**
     once for the whole rewrite (the complement filter doesn't change between 
files) and passes the resulting closure into every `rewriteSingleFile` call. 
The `AlwaysFalse` fast-path now returns `rec.NewSlice(0,  0)` so downstream 
code that calls `rec.Column(i)` doesn't panic on the empty result.
     
     **8. `table/transaction.go`  : `AlwaysTrue` disables pushdown for the 
rewrite scan.**
     Left a `TODO(perf)` comment on `rewriteSingleFile` calling out the IO cost 
of disabling pushdown when preserving lineage, and noted that file-level 
skipping (against the bound filter's residual on file  stats) could be 
re-enabled here without breaking `_row_id` synthesis since synthesis only 
depends on within-file row positions. Punting the actual implementation to a 
follow-up — agreed it's worth fixing but didn't want to expand this PR into a 
metrics-eval refactor.
     
     **9. `table/rewrite_data_files.go`  : `hasRowLineage` comment claims a 
table property but checks per-task.**
     Renamed to `allTasksHaveRowLineage` and rewrote the comment. The function 
now also returns `false` on an empty task list (defensive), and the call-site 
comment explains why we gate on "all" rather than "any" (mixed groups → 
per-file invariant violation, see thread 3).
     
     **10. `table/write_records.go`  : `WithPreserveRowLineage` accepts 
anything.**
     Added validation in `WriteRecords`: errors when the table format version 
is below 3, and errors when the input arrow schema doesn't include the 
`_row_id` column. Docstring rewritten to call this out. Implementation-wise 
this required a small reshuffle: `cfg` is now built before 
`checkArrowSchemaCompat` runs so the compat check can be against the projected 
(with-lineage) schema rather than `tbl.Schema()` — otherwise the compat check 
itself would reject `_row_id` as unknown.
     
     **11. `table/row_lineage_rewrite_test.go`  : `GreaterOrEqual(5)` is 
vacuous.**
     Tightened to `assert.Equal(t, int64(5), tbl.Metadata().NextRowID())`. The 
comment above the assertion already had the exact expected value (3 + 2), so 
the loose form was just hiding regressions in either  direction.
     
     **12. `table/snapshot_producers.go`  : `ManifestListWriter.V4Writer` 
doesn't exist.**
     Typo, fixed to `V3Writer`. Same fix applied to the matching comment in the 
test that referenced V4Writer for context.
     
     ---
     
     **Newly added tests:**
     - `TestProjectionV3SelectRowLineageColumns`  : `Select("_row_id")` works 
on v3
     - `TestProjectionRowLineageRejectedOnV1V2`  : same select on v1/v2 errors 
with `ErrInvalidSchema`
     - `TestProjectionWithRowLineageRequiresV3`  : `WithRowLineage` errors on 
v1/v2 
     - `TestProjectionV3SchemaAlreadyHasRowID`  : idempotency under reserved-id 
collision
     - `TestExecuteCompactionGroupPreservesRowID`  : end-to-end compaction on a 
v3 table preserves `_row_id` for every row across two source files
     
     PTAL when you have a moment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to