The GitHub Actions job "Build documentation" on flink.git has succeeded.
Run started by GitHub user zentol (triggered by zentol).

Head commit for run:
c0936deaf99390fc727acc8633e3be22e62f4bf5 / Roman Khachatryan 
<khachatryan.ro...@gmail.com>
[FLINK-26985][runtime] Don't discard shared state of restored checkpoints

Currently, in LEGACY restore mode, shared state of
incremental checkpoints can be discarded regardless
of whether they were created by this job or not.
This invalidates the checkpoint from which the job
was restored.

The bug was introduced in FLINK-24611. Before that,
reference count was maintained for each shared state
entry; "initial" checkpoints did not decrement this
count, preventing their shared state from being discarded.

This change makes SharedStateRegistry to:
1. Remember the max checkpiont ID encountered during recovery
2. Associate each shared state entry with a checkpoint ID that created it
3. Only discard the entry if its createdByCheckpointID > 
highestRetainCheckpointID

(1) is called from:
- CheckpointCoordinator.restoreSavepoint - to cover initial restore from a 
checkpoint
- SharedStateFactory, when building checkpoint store - to cover the failover 
case
(see DefaultExecutionGraphFactory.createAndRestoreExecutionGraph)

Adjusting only the CheckpointCoordinator path isn't sufficient:
- job recovers from an existing checkpoints, adds it to the store
- a new checkpoint is created - with the default restore settings
- a failure happens, job recovers from a newer checkpoint
- when a newer checkpoint is subsumed, its (inherited) shared state
might be deleted

Report URL: https://github.com/apache/flink/actions/runs/2118126361

With regards,
GitHub Actions via GitBox

Reply via email to