johnhoran opened a new pull request, #65202:
URL: https://github.com/apache/airflow/pull/65202
We had a recent task failure which we were only alerted on once we hit the
dagrun timeout. The logs looked like
```
2026-04-10, 09:02:25 UTC] {pod.py:1425} INFO - Building pod ...-fwg83fmr
with labels: {'dag_id': '...', 'task_id': '...-cdf8135eb', 'run_id':
'scheduled__2026-04-09T0900000000-020f81fe0', 'kubernetes_pod_operator':
'True', 'try_number': '1'}
[2026-04-10, 09:02:25 UTC] {pod.py:601} INFO - Found matching pod
...-fwg83fmr with labels {'airflow_kpo_in_cluster': 'True', 'airflow_version':
'2.11.2-astro.2', 'app': 'airflow', 'astronomer.io/cloud_provider': 'aws',
'astronomer.io/cloud_region': 'us-west-2', 'astronomer.io/deploymentId': '...',
'astronomer.io/organizationId': '...', 'astronomer.io/workspaceId': '...',
'dag_id': '...', 'kubernetes_pod_operator': 'True', 'run_id':
'scheduled__2026-04-09T0900000000-020f81fe0', 'task_id': '...-cdf8135eb',
'try_number': '1'}
[2026-04-10, 09:02:25 UTC] {pod.py:602} INFO - `try_number` of
task_instance: 1
[2026-04-10, 09:02:25 UTC] {pod.py:603} INFO - `try_number` of pod: 1
[2026-04-10, 09:02:25 UTC] {pod.py:895} WARNING - Could not resolve
connection extras for deferral: connection `kubernetes_default` not found.
Triggerer will try to resolve it from its own environment.
[2026-04-10, 09:02:25 UTC] {taskinstance.py:297} INFO - Pausing task as
DEFERRED. dag_id=..., task_id=..._opportunity_daily_stage_run,
run_id=scheduled__2026-04-09T09:00:00+00:00, execution_date=20260409T090000,
start_date=20260410T090223
[2026-04-10, 09:02:25 UTC] {taskinstance.py:349} ▶ Post task execution logs
[2026-04-10, 09:02:26 UTC] {pod.py:177} INFO - Checking pod '...-fwg83fmr'
in namespace '...' with poll interval 2.
[2026-04-10, 09:02:26 UTC] {pod_manager.py:138} ▼ Waiting until 600s to get
the POD scheduled...
[2026-04-10, 09:02:26 UTC] {kubernetes.py:1160} WARNING - Kubernetes API
does not permit watching events; falling back to polling: (403)
Reason: Forbidden: events is forbidden: User
"system:serviceaccount:...:...-triggerer-serviceaccount" cannot watch resource
"events" in API group "" in the namespace "..."
[2026-04-10, 09:02:26 UTC] {pod_manager.py:116} INFO - The Pod has an Event:
0/13 nodes are available: 1 node(s) had untolerated taint
{karpenter.sh/disrupted: }, 2 node(s) didn't match Pod's node
affinity/selector, 2 node(s) had untolerated taint
{eks.amazonaws.com/compute-type: fargate}, 3 node(s) had untolerated taint
{astronomer.io/node-group: airflow-system}, 5 Insufficient memory. preemption:
not eligible due to preemptionPolicy=Never. from None
[2026-04-10, 09:02:31 UTC] {pod_manager.py:116} INFO - The Pod has an Event:
Pod should schedule on: nodeclaim/airflow-worker-primary-9km4s from None
[2026-04-10, 09:02:36 UTC] {pod_manager.py:116} INFO - The Pod has an Event:
0/14 nodes are available: 1 node(s) had untolerated taint
{ebs.csi.aws.com/agent-not-ready: }, 1 node(s) had untolerated taint
{karpenter.sh/disrupted: }, 2 node(s) didn't match Pod's node
affinity/selector, 2 node(s) had untolerated taint
{eks.amazonaws.com/compute-type: fargate}, 3 node(s) had untolerated taint
{astronomer.io/node-group: airflow-system}, 5 Insufficient memory. preemption:
not eligible due to preemptionPolicy=Never. from None
[2026-04-10, 09:02:52 UTC] {pod_manager.py:150} ▲▲▲ Log group end
[2026-04-10, 11:07:15 UTC] {pod.py:177} INFO - Checking pod '...-fwg83fmr'
in namespace '...' with poll interval 2.
[2026-04-10, 11:07:15 UTC] {pod_manager.py:138} ▼ Waiting until 600s to get
the POD scheduled...
[2026-04-10, 11:07:16 UTC] {pod_manager.py:150} ▲▲▲ Log group end
[2026-04-10, 11:11:54 UTC] {pod.py:177} INFO - Checking pod '...-fwg83fmr'
in namespace '...' with poll interval 2.
[2026-04-10, 11:11:54 UTC] {pod_manager.py:138} ▼ Waiting until 600s to get
the POD scheduled...
[2026-04-10, 11:11:54 UTC] {pod_manager.py:150} ▲▲▲ Log group end
[2026-04-10, 11:18:55 UTC] {pod.py:177} INFO - Checking pod '...-fwg83fmr'
in namespace '...' with poll interval 2.
[2026-04-10, 11:18:56 UTC] {pod_manager.py:138} ▼ Waiting until 600s to get
the POD scheduled...
[2026-04-10, 11:18:56 UTC] {pod_manager.py:150} ▲▲▲ Log group end
[2026-04-10, 12:00:03 UTC] {pod.py:448} INFO - Deleting pod ...-fwg83fmr in
namespace ....
[2026-04-10, 12:00:03 UTC] {pod.py:456} ERROR - Unexpected error while
deleting pod ...-fwg83fmr
Traceback (most recent call last):
File
"/usr/local/lib/python3.12/site-packages/airflow/jobs/triggerer_job_runner.py",
line 558, in cleanup_finished_triggers
result = details["task"].result()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 518,
in thread_handler
raise exc_info[1]
File
"/usr/local/lib/python3.12/site-packages/airflow/jobs/triggerer_job_runner.py",
line 630, in run_trigger
async for event in trigger.run():
File
"/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py",
line 206, in run
event = await self._wait_for_container_completion()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py",
line 340, in _wait_for_container_completion
await asyncio.sleep(self.poll_interval)
File "/usr/local/lib/python3.12/asyncio/tasks.py", line 665, in sleep
return await future
^^^^^^^^^^^^
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py",
line 450, in cleanup
await self.hook.delete_pod(
File
"/usr/local/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line
189, in async_wrapped
return await copy(fn, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line
111, in __call__
do = await self.iter(retry_state=retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line
153, in iter
result = await action(retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/tenacity/_utils.py", line
99, in inner
return call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/tenacity/__init__.py", line
400, in <lambda>
self._add_action_func(lambda rs: rs.outcome.result())
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in
result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in
__get_result
raise self._exception
File
"/usr/local/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line
114, in __call__
result = await fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py",
line 1008, in delete_pod
await v1_api.delete_namespaced_pod(
File
"/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py",
line 117, in call_api
return await super().call_api(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/client/api_client.py",
line 192, in __call_api
raise e
File
"/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/client/api_client.py",
line 185, in __call_api
response_data = await self.request(
^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/client/rest.py",
line 239, in DELETE
return (await self.request("DELETE", url,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/client/rest.py",
line 206, in request
raise ApiException(http_resp=r)
kubernetes_asyncio.client.exceptions.ApiException: (403)
Reason: Forbidden
HTTP response headers: <CIMultiDictProxy('Audit-Id':
'96014b60-12ff-4e23-8ef2-15949b6bb0c4', 'Cache-Control': 'no-cache, private',
'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff',
'X-Kubernetes-Pf-Flowschema-Uid': '332d44d3-abc1-4edf-9669-08749324024e',
'X-Kubernetes-Pf-Prioritylevel-Uid': '04963fcf-132d-4951-a31a-17392195da29',
'Date': 'Fri, 10 Apr 2026 12:00:03 GMT', 'Content-Length': '499')>
HTTP response body:
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods
\"...-fwg83fmr\" is forbidden: User
\"system:serviceaccount:...:...-triggerer-serviceaccount\" cannot delete
resource \"pods\" in API group \"\" in the namespace
\"...\"","reason":"Forbidden","details":{"name":"...-fwg83fmr","kind":"pods"},"code":403}
```
Most notably are the lines `{pod_manager.py:150} ▲▲▲ Log group end` which
indicate that the state of the pod was at least not pending when it reached
this. Given the other phases the pod could have been in, I believe the most
likely situation is that there was a node communication issue and that the pod
phase was unknown. That this allows us to break out of the `await_pod_start`
loop feels incorrect. I think it should remain in the loop and be allowed to
hit the scheduled timeout, same as if it was stuck in pending.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]