Yanis Djeridi created FLINK-39356:
-------------------------------------

             Summary: FlinkStateSnapshot cleanup fails with NPE when status is 
null, permanently blocking CR and namespace deletion
                 Key: FLINK-39356
                 URL: https://issues.apache.org/jira/browse/FLINK-39356
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
    Affects Versions: 1.14.0
            Reporter: Yanis Djeridi


## Problem

When a FlinkStateSnapshot CR is deleted before the FlinkStateSnapshotController 
has reconciled it, the cleanup() method throws a NullPointerException on 
getStatus().getState(). The exception is caught in a way that prevents the 
finalizer from being removed, causing the CR to be permanently stuck in a 
terminating state. This blocks namespace deletion. 

## Root Cause

FlinkStateSnapshotController.cleanup() and 
FlinkResourceContextFactory.getFlinkStateSnapshotContext() assume the status is 
non-null, but FlinkStateSnapshot CRs are created without a status (the status 
subresource is only populated when reconcile() runs). If the CR receives a 
deletion timestamp before reconcile() runs, JOSDK calls cleanup() directly — 
the only code path that initializes null status is in reconcile(), not 
cleanup().

The NPE propagates to JOSDK (or is caught by the controller's catch block 
returning noFinalizerRemoval()), causing an infinite retry loop where every 
attempt crashes the same way.

reconcile() already handles this case correctly:


```
// status might be null here
flinkStateSnapshot.setStatus(
        Objects.requireNonNullElseGet(
                flinkStateSnapshot.getStatus(), FlinkStateSnapshotStatus::new));
```
cleanup() and updateErrorStatus() are missing this guard.


## Impact

- FlinkStateSnapshot CRs with null status accumulate as zombie resources with 
finalizers that can never be cleared.
- Namespace deletion is blocked indefinitely

## Fix 

Add the same null-status initialization to cleanup() and updateErrorStatus() 
that already exists in reconcile(). A null-status snapshot was never triggered 
against Flink, so no data exists on storage and cleanup can safely proceed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to