johnhoran opened a new pull request, #65202:
URL: https://github.com/apache/airflow/pull/65202

   We had a recent task failure which we were only alerted on once we hit the 
dagrun timeout.  The logs looked like 
   ```
   2026-04-10, 09:02:25 UTC] {pod.py:1425} INFO - Building pod ...-fwg83fmr 
with labels: {'dag_id': '...', 'task_id': '...-cdf8135eb', 'run_id': 
'scheduled__2026-04-09T0900000000-020f81fe0', 'kubernetes_pod_operator': 
'True', 'try_number': '1'}
   [2026-04-10, 09:02:25 UTC] {pod.py:601} INFO - Found matching pod 
...-fwg83fmr with labels {'airflow_kpo_in_cluster': 'True', 'airflow_version': 
'2.11.2-astro.2', 'app': 'airflow', 'astronomer.io/cloud_provider': 'aws', 
'astronomer.io/cloud_region': 'us-west-2', 'astronomer.io/deploymentId': '...', 
'astronomer.io/organizationId': '...', 'astronomer.io/workspaceId': '...', 
'dag_id': '...', 'kubernetes_pod_operator': 'True', 'run_id': 
'scheduled__2026-04-09T0900000000-020f81fe0', 'task_id': '...-cdf8135eb', 
'try_number': '1'}
   [2026-04-10, 09:02:25 UTC] {pod.py:602} INFO - `try_number` of 
task_instance: 1
   [2026-04-10, 09:02:25 UTC] {pod.py:603} INFO - `try_number` of pod: 1
   [2026-04-10, 09:02:25 UTC] {pod.py:895} WARNING - Could not resolve 
connection extras for deferral: connection `kubernetes_default` not found. 
Triggerer will try to resolve it from its own environment.
   [2026-04-10, 09:02:25 UTC] {taskinstance.py:297} INFO - Pausing task as 
DEFERRED. dag_id=..., task_id=..._opportunity_daily_stage_run, 
run_id=scheduled__2026-04-09T09:00:00+00:00, execution_date=20260409T090000, 
start_date=20260410T090223
   [2026-04-10, 09:02:25 UTC] {taskinstance.py:349} ▶ Post task execution logs
   [2026-04-10, 09:02:26 UTC] {pod.py:177} INFO - Checking pod '...-fwg83fmr' 
in namespace '...' with poll interval 2.
   [2026-04-10, 09:02:26 UTC] {pod_manager.py:138} ▼ Waiting until 600s to get 
the POD scheduled...
   [2026-04-10, 09:02:26 UTC] {kubernetes.py:1160} WARNING - Kubernetes API 
does not permit watching events; falling back to polling: (403)
   Reason: Forbidden: events is forbidden: User 
"system:serviceaccount:...:...-triggerer-serviceaccount" cannot watch resource 
"events" in API group "" in the namespace "..."
   [2026-04-10, 09:02:26 UTC] {pod_manager.py:116} INFO - The Pod has an Event: 
0/13 nodes are available: 1 node(s) had untolerated taint 
{karpenter.sh/disrupted: }, 2 node(s) didn't match Pod's node 
affinity/selector, 2 node(s) had untolerated taint 
{eks.amazonaws.com/compute-type: fargate}, 3 node(s) had untolerated taint 
{astronomer.io/node-group: airflow-system}, 5 Insufficient memory. preemption: 
not eligible due to preemptionPolicy=Never. from None
   [2026-04-10, 09:02:31 UTC] {pod_manager.py:116} INFO - The Pod has an Event: 
Pod should schedule on: nodeclaim/airflow-worker-primary-9km4s from None
   [2026-04-10, 09:02:36 UTC] {pod_manager.py:116} INFO - The Pod has an Event: 
0/14 nodes are available: 1 node(s) had untolerated taint 
{ebs.csi.aws.com/agent-not-ready: }, 1 node(s) had untolerated taint 
{karpenter.sh/disrupted: }, 2 node(s) didn't match Pod's node 
affinity/selector, 2 node(s) had untolerated taint 
{eks.amazonaws.com/compute-type: fargate}, 3 node(s) had untolerated taint 
{astronomer.io/node-group: airflow-system}, 5 Insufficient memory. preemption: 
not eligible due to preemptionPolicy=Never. from None
   [2026-04-10, 09:02:52 UTC] {pod_manager.py:150} ▲▲▲ Log group end
   [2026-04-10, 11:07:15 UTC] {pod.py:177} INFO - Checking pod '...-fwg83fmr' 
in namespace '...' with poll interval 2.
   [2026-04-10, 11:07:15 UTC] {pod_manager.py:138} ▼ Waiting until 600s to get 
the POD scheduled...
   [2026-04-10, 11:07:16 UTC] {pod_manager.py:150} ▲▲▲ Log group end
   [2026-04-10, 11:11:54 UTC] {pod.py:177} INFO - Checking pod '...-fwg83fmr' 
in namespace '...' with poll interval 2.
   [2026-04-10, 11:11:54 UTC] {pod_manager.py:138} ▼ Waiting until 600s to get 
the POD scheduled...
   [2026-04-10, 11:11:54 UTC] {pod_manager.py:150} ▲▲▲ Log group end
   [2026-04-10, 11:18:55 UTC] {pod.py:177} INFO - Checking pod '...-fwg83fmr' 
in namespace '...' with poll interval 2.
   [2026-04-10, 11:18:56 UTC] {pod_manager.py:138} ▼ Waiting until 600s to get 
the POD scheduled...
   [2026-04-10, 11:18:56 UTC] {pod_manager.py:150} ▲▲▲ Log group end
   [2026-04-10, 12:00:03 UTC] {pod.py:448} INFO - Deleting pod ...-fwg83fmr in 
namespace ....
   [2026-04-10, 12:00:03 UTC] {pod.py:456} ERROR - Unexpected error while 
deleting pod ...-fwg83fmr
   Traceback (most recent call last):
     File 
"/usr/local/lib/python3.12/site-packages/airflow/jobs/triggerer_job_runner.py", 
line 558, in cleanup_finished_triggers
       result = details["task"].result()
                ^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 518, 
in thread_handler
       raise exc_info[1]
     File 
"/usr/local/lib/python3.12/site-packages/airflow/jobs/triggerer_job_runner.py", 
line 630, in run_trigger
       async for event in trigger.run():
     File 
"/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py",
 line 206, in run
       event = await self._wait_for_container_completion()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py",
 line 340, in _wait_for_container_completion
       await asyncio.sleep(self.poll_interval)
     File "/usr/local/lib/python3.12/asyncio/tasks.py", line 665, in sleep
       return await future
              ^^^^^^^^^^^^
   asyncio.exceptions.CancelledError
   During handling of the above exception, another exception occurred:
   Traceback (most recent call last):
     File 
"/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py",
 line 450, in cleanup
       await self.hook.delete_pod(
     File 
"/usr/local/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 
189, in async_wrapped
       return await copy(fn, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 
111, in __call__
       do = await self.iter(retry_state=retry_state)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 
153, in iter
       result = await action(retry_state)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/site-packages/tenacity/_utils.py", line 
99, in inner
       return call(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/site-packages/tenacity/__init__.py", line 
400, in <lambda>
       self._add_action_func(lambda rs: rs.outcome.result())
                                        ^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in 
result
       return self.__get_result()
              ^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in 
__get_result
       raise self._exception
     File 
"/usr/local/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 
114, in __call__
       result = await fn(*args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py",
 line 1008, in delete_pod
       await v1_api.delete_namespaced_pod(
     File 
"/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py",
 line 117, in call_api
       return await super().call_api(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/client/api_client.py",
 line 192, in __call_api
       raise e
     File 
"/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/client/api_client.py",
 line 185, in __call_api
       response_data = await self.request(
                       ^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/client/rest.py", 
line 239, in DELETE
       return (await self.request("DELETE", url,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/client/rest.py", 
line 206, in request
       raise ApiException(http_resp=r)
   kubernetes_asyncio.client.exceptions.ApiException: (403)
   Reason: Forbidden
   HTTP response headers: <CIMultiDictProxy('Audit-Id': 
'96014b60-12ff-4e23-8ef2-15949b6bb0c4', 'Cache-Control': 'no-cache, private', 
'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 
'X-Kubernetes-Pf-Flowschema-Uid': '332d44d3-abc1-4edf-9669-08749324024e', 
'X-Kubernetes-Pf-Prioritylevel-Uid': '04963fcf-132d-4951-a31a-17392195da29', 
'Date': 'Fri, 10 Apr 2026 12:00:03 GMT', 'Content-Length': '499')>
   HTTP response body: 
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods
 \"...-fwg83fmr\" is forbidden: User 
\"system:serviceaccount:...:...-triggerer-serviceaccount\" cannot delete 
resource \"pods\" in API group \"\" in the namespace 
\"...\"","reason":"Forbidden","details":{"name":"...-fwg83fmr","kind":"pods"},"code":403}
   ```
   
   Most notably are the lines `{pod_manager.py:150} ▲▲▲ Log group end` which 
indicate that the state of the pod was at least not pending when it reached 
this.  Given the other phases the pod could have been in, I believe the most 
likely situation is that there was a node communication issue and that the pod 
phase was unknown.  That this allows us to break out of the `await_pod_start` 
loop feels incorrect.  I think it should remain in the loop and be allowed to 
hit the scheduled timeout, same as if it was stuck in pending.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to