tongwai-wong-appier commented on issue #13763:
URL: https://github.com/apache/iceberg/issues/13763#issuecomment-4438421059
@kumarpritam863 Thanks for the explanation. That helps clarify the current
offset-validation path, especially for what I would call the "stale replay"
case.
To make sure I understand the guarantees correctly, would it be reasonable
to separate the situations into these three cases?
1. **Case 1: the exact same `DATA_WRITTEN(file X)` control-topic record is
replayed**
- same control-topic offset
- replayed after recovery / coordinator switchover
- from reading current `main`, this appears to be covered by:
- filtering against committed control-topic offsets before
`commitToTable()`
- plus `SnapshotAncestryValidator`
2. **Case 2: two distinct `DATA_WRITTEN(file X)` records are buffered, but
both land in the same commit**
- from reading current `main`, this appears to be covered by
`distinctByKey(ContentFile::location)` inside `commitToTable()`
3. **Case 3: two distinct `DATA_WRITTEN(file X)` records are produced across
two different commit cycles / snapshots**
- for example:
- commit A already appends file `X` into snapshot `S1`
- then a later `StartCommit` causes a new `DATA_WRITTEN(file X)` event
to be emitted again
- that later event has a new control-topic offset and may belong to a
new commitId
- then commit B attempts to append the same physical file `X` into
snapshot `S2`
Our current suspicion is that our incident may be closer to **Case 3**, not
Case 1.
So the question I want to confirm is:
> Does current `main` also guarantee deduplication / rejection for **Case
3**?
>
> In other words, if the second `DATA_WRITTEN(file X)` is not a replay of
the old control-topic record, but a newly produced control-topic event with a
newer offset, which part of the current logic prevents `file X` from being
appended again into a later snapshot?
The reason I am asking is that the current protections seem clearly tied to:
- stale control-topic offset replay, and
- concurrent commit validation
but I am not yet seeing an obvious cross-snapshot file-level idempotency
check for the "same physical file, new control-topic event" case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]