Yanis Djeridi created FLINK-39356:
-------------------------------------
Summary: FlinkStateSnapshot cleanup fails with NPE when status is
null, permanently blocking CR and namespace deletion
Key: FLINK-39356
URL: https://issues.apache.org/jira/browse/FLINK-39356
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Affects Versions: 1.14.0
Reporter: Yanis Djeridi
## Problem
When a FlinkStateSnapshot CR is deleted before the FlinkStateSnapshotController
has reconciled it, the cleanup() method throws a NullPointerException on
getStatus().getState(). The exception is caught in a way that prevents the
finalizer from being removed, causing the CR to be permanently stuck in a
terminating state. This blocks namespace deletion.
## Root Cause
FlinkStateSnapshotController.cleanup() and
FlinkResourceContextFactory.getFlinkStateSnapshotContext() assume the status is
non-null, but FlinkStateSnapshot CRs are created without a status (the status
subresource is only populated when reconcile() runs). If the CR receives a
deletion timestamp before reconcile() runs, JOSDK calls cleanup() directly —
the only code path that initializes null status is in reconcile(), not
cleanup().
The NPE propagates to JOSDK (or is caught by the controller's catch block
returning noFinalizerRemoval()), causing an infinite retry loop where every
attempt crashes the same way.
reconcile() already handles this case correctly:
```
// status might be null here
flinkStateSnapshot.setStatus(
Objects.requireNonNullElseGet(
flinkStateSnapshot.getStatus(), FlinkStateSnapshotStatus::new));
```
cleanup() and updateErrorStatus() are missing this guard.
## Impact
- FlinkStateSnapshot CRs with null status accumulate as zombie resources with
finalizers that can never be cleared.
- Namespace deletion is blocked indefinitely
## Fix
Add the same null-status initialization to cleanup() and updateErrorStatus()
that already exists in reconcile(). A null-status snapshot was never triggered
against Flink, so no data exists on storage and cleanup can safely proceed
--
This message was sent by Atlassian Jira
(v8.20.10#820010)