Hi Yanis,

Thanks for bringing this up. I didn’t get to this before but I took a crack at 
it, allow me to make some observations. (BTW, have you already started work 
on/completed this?)

The problem statements and root causes are accurate, perhaps a few items in the 
expected behavior need adjustment. I’m also adding some edge cases and 
outlining potential work items. Please let me know your thoughts.

Clarifications:
  - FlinkDeployment: "A subsequent change to running triggers a normal first 
deployment” is incorrect. Once lastReconciledSpec is written for the suspended 
initial state, isBeforeFirstDeployment() is permanently false. The running 
change is handled by
  the upgrade path, not the first-deployment path. Maybe it should be described 
as a STATELESS upgrade.
  - FlinkBlueGreenDeployment: instead of "observedGeneration set on every 
status update”, it should only advance when a spec generation is fully 
reconciled (i.e., when lastReconciledSpec is written), not on transition 
progress patches. We’re probably referring to the same thing, just clarifying.

Missing details or edge cases:
  - FlinkDeployment:
        - If job.initialSavepointPath is set alongside job.state: suspended, it 
will be silently lost when the user later changes to running. The 
first-deployment path normally handles this by copying it to 
upgradeSavepointPath; the upgrade path does not.
        - updateStatusForSpecReconciliation() automatically calls 
markReconciledSpecAsStable() when the spec's job state is SUSPENDED. 
Acknowledging a suspended initial spec should result in 
reconciliationStatus.state = STABLE, not UPGRADING or DEPLOYED, right?
        - Error semantics change: Currently, if there’s an error 
isBeforeFirstDeployment() = true, getLifecycleState() -> FAILED. After this 
fix, with lastReconciledSpec set to SUSPENDED, an error would leave 
lifecycleState as SUSPENDED, because the SUSPENDED check runs
  before the job-failed/JM-error checks. Let’ decide if errors during the 
suspended-initial phase should surface as FAILED or SUSPENDED?

  - FlinkBlueGreenDeployment:
        - FlinkBlueGreenDeploymentStatus doesn't extend CommonStatus by design. 
This is not a problem, we just add the property and the code to write it, the 
most natural place is inside setLastReconciledSpec() in BlueGreenUtils
        - CRITICAL: I realized after recording lastReconciledSpec for the 
suspended initial state, InitializingBlueStateHandler's deploy condition 
becomes false for a SUSPENDED → RUNNING spec change (it only checks for null or 
FAILING), causing noUpdate() to be returned and the deployment to never 
trigger. The condition must be extended to handle this transition.
        - Are you relying on the lifecycleState as well, because there’s no 
equivalent behavior for this field for Blue/Green deployments. We can address 
this gap separately if necessary.

Sergio


> On Mar 11, 2026, at 7:21 AM, Yanis Djeridi via dev <[email protected]> 
> wrote:
> 
> Hi everyone,
> 
> I would like to start a discussion about FLINK-39243: Include 
> observedGeneration for Suspended Flink Deployments [1].
> 
> Currently, there are two gaps in how the Flink Kubernetes Operator handles 
> observedGeneration, which violates Kubernetes API conventions and breaks 
> integration with standard deployment tools (e.g., Kapp) that rely on 
> observedGeneration to determine whether a controller has processed a spec 
> change:
> 
> FlinkDeployment: When created with spec.job.state: suspended, the operator 
> returns early without updating any status fields, observedGeneration, 
> lastReconciledSpec, and lifecycleState all remain unset.
> 
> FlinkBlueGreenDeployment: The status schema does not include an 
> observedGeneration field at all, so deployment tools can never determine 
> whether the controller has processed a given generation.
> 
> The proposed changes are:
> 
> For FlinkDeployment: acknowledge the suspended spec by setting 
> status.observedGeneration, recording lastReconciledSpec with state SUSPENDED, 
> and setting lifecycleState to SUSPENDED, without deploying any Flink 
> resources.
> 
> For FlinkBlueGreenDeployment: add an observedGeneration field to the status 
> class and record lastReconciledSpec when blocking on a suspended initial 
> state.
> 
> Looking forward to your feedback on the approach!
> 
> [1] https://issues.apache.org/jira/browse/FLINK-39243
> 
> Best Regards,
> Yanis Djeridi
> 

Reply via email to