tanmayrauth commented on issue #999: URL: https://github.com/apache/iceberg-go/issues/999#issuecomment-4462632884
Here's my thinking on the three open questions. Cross-referenced Java's
implementation of Spark/Flink write paths.
---
1. Where the rewrite path reads source _row_id
I'd go with scanner projection rather than re-reading source Parquet
separately.
During a rewrite, we add _row_id and _last_updated_sequence_number to the
scan's projected schema. The scanner already synthesizes these from file
metadata (first_row_id + position). The writer then stores the values
explicitly in output Parquet instead of leaving them null.
Java does the same, SparkTable adds lineage columns to the read schema
when isTableRewrite=true. The reader transparently handles both cases: files
that already have an explicit _row_id column (from a previous rewrite) use the
stored value; files without it get synthesis. Our synthesizeRowLineageColumns
already implements this dual path.
The alternative (re-reading source Parquet outside the scanner) would add
a second I/O pass, duplicate synthesis logic, and break on multi-rewrite files
that already have explicit _row_id. I don't see a reason to diverge from Java
here.
I'm thinking a WithPreserveRowLineage() writer optio, the change lives
entirely at the scan+write layer, snapshot producer stays unaware.
---
2. next-row-id accounting for rewrites
@laskoviymishka I believe the current logic is already correct and no
change is needed.
Today addedRows = *writer.NextRowID() - firstRowID counts all rows in new
manifests, including preserved survivors. This "wastes" ID space but doesn't
violate uniquenes, the actual row IDs seen at query time come from the explicit
Parquet column, not the global counter.
Java does the same. ManifestListWriter.V4Writer advances nextRowId by the
full manifest row count regardless of whether rows inside preserve old IDs. The
ID space is int64, so waste isn't a practical concern.
I'd add a test + comment documenting this is intentional rather than
changing the accounting.
---
3. MoR equality-rewrite: which rows get fresh IDs?
For pure compaction (bin-pack), I don't think we need a distinction, lineage
columns flow through as regular data from the scan projection.
For MoR UPDATEs (position-delta path), the writer needs to know intent.
I'm thinking an insert vs reinsert split on the writer API:
- Reinsert(record, rowID, seqNum, survivor row, preserves _row_id, nulls
_last_updated_sequence_number) so it inherits the new file's
data_sequence_number
- Insert(record) - fresh row, both null, gets new identity at read time
Java uses this exact pattern: SparkPositionDeltaWrite.reinsert(meta, row)
vs insert(row).
Worth noting: equality deletes cannot preserve lineage per spec, engine
writes without reading old identity. Only pos-delete rewrites and CoW can
preserve.
---
PR sequence (proposed)
1. CoW rewrite preserves IDs : lineage columns in compaction scan,
explicit write in output Parquet
2. Accounting validation : test + comment confirming overcounting is
correct
3. MoR equality-rewrite : insert/reinsert writer API
@zeroshade @laskoviymishka Let me know if any of this feels off or if
you'd approach something differently.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
