GitHub user rcrchawla added a comment to the discussion: Airflow task failed 
but spark kube app is running

Hi @shaealh 
As discussed we have lowered down the concurrency and it is working fine but we 
faced one challenge which is one time only where scheduler is launching a task 
(Spark Kubernetes Operator) which is working fine for few minutes and after 
that worker is unable to send the signal and it retried 5 times and worker got 
the message "Server error task instance is already running somewhere else" and 
where it launched another pid for same task _id now we see in the airflow logs 
this task wait for getting the resource to launch the driver where we have set 
the property "startup_timeout_seconds" to 40 minutes but in the logs after it 
waiting it killed the task now not sure why it has killed the task.
Could you please suggest anything by seeing the below summarized logs : 
[Uploading 20260326.zip…]()

airflow-worker-1

2026-03-26 02:14 -- 173a1136-b645-4d67-a13a-a485a3704efc -- 
load_track_group_id_nerd_portals -- 019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:26.905924 [warning  ] Failed to send heartbeat. Will be 
retried [supervisor] failed_heartbeats=1 max_retries=3 
ti_id=UUID('019d277e-99e9-763d-b75b-d6cd75120881')
2026-03-26 02:47:57.527466 [warning  ] Server error                   
[airflow.sdk.api.client] detail={'detail': {'reason': 'not_running', 'message': 
'TI is no longer in the running state and task should terminate', 
'current_state': None}}
2026-03-26 02:47:57.527755 [error    ] Server indicated the task shouldn't be 
running anymore [supervisor] detail={'detail': {'reason': 'not_running', 
'message': 'TI is no longer in the running state and task should terminate', 
'current_state': None}} status_code=409 
ti_id=UUID('019d277e-99e9-763d-b75b-d6cd75120881')
2026-03-26 02:47:58.194844 [info     ] Task 
execute_workload[3fb3128a-86c9-43bc-a1fb-3b0a6dcbcec4] received 
[celery.worker.strategy]
2026-03-26 02:47:58.228603 [info     ] [3fb3128a-86c9-43bc-a1fb-3b0a6dcbcec4] 
Executing workload in Celery: token='eyJ***' 
ti=TaskInstance(id=UUID('019d277e-99e9-763d-b75b-d6cd75120881'), 
task_id='ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals', 
dag_id='segment-groupid-1', run_id='scheduled__2026-03-26T00:15:00+00:00', 
try_number=2, map_index=-1, pool_slots=1, queue='default', priority_weight=13, 
executor_config=None, parent_context_carrier={}, context_carrier={}, 
queued_dttm=datetime.datetime(2026, 3, 26, 2, 14, 17, 330486, 
tzinfo=TzInfo(UTC))) dag_rel_path=PurePosixPath('segment_groupid_1.py') 
bundle_info=BundleInfo(name='dags-folder', version=None) 
log_path='dag_id=segment-groupid-1/run_id=scheduled__2026-03-26T00:15:00+00:00/task_id=ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals/attempt=2.log'
 type='ExecuteTask' [airflow.providers.celery.executors.celery_executor_utils]
2026-03-26 02:47:58.256886 [info     ] Secrets backends loaded for worker 
[supervisor] 

2026-03-26 02:48:02.577882 [info     ] Task 
execute_workload[173a1136-b645-4d67-a13a-a485a3704efc] succeeded in 
2025.1995898000896s: None [celery.app.trace]

2026-03-26 02:48:10.794292 [info     ] Task finished                  
[supervisor] duration=12.537848281674087 exit_code=0 final_state=failed
2026-03-26 02:48:10.810689 [info     ] Task 
execute_workload[3fb3128a-86c9-43bc-a1fb-3b0a6dcbcec4] succeeded in 
12.614215551875532s: None [celery.app.trace]
2026-03-26 02:49:55.420123 [info     ] Task finished                  
[supervisor] duration=36.28673866484314 exit_code=0 final_state=success

--------------------------------------------------------------------------------------------------------
api_server
---------------------------------------------
2026-03-26 02:45:47 [debug    ] Processing heartbeat           
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
pid=1193430 ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:45:47 [debug    ] Retrieved current task state   
current_hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
current_pid=1193430 state=running ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:45:47 [debug    ] Heartbeat updated              state=running 
ti_id=019d277e-99e9-763d-b75b-d6cd75120881
INFO:     172.26.145.209:60936 - "PUT 
/execution/task-instances/019d277e-99e9-763d-b75b-d6cd75120881/heartbeat 
HTTP/1.1" 204 No Content
2026-03-26 02:46:38 [debug    ] Retrieved current task state   
current_hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
current_pid=1193430 state=running ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:46:38 [debug    ] Heartbeat updated              state=running 
ti_id=019d277e-99e9-763d-b75b-d6cd75120881
INFO:     172.26.145.209:50958 - "PUT 
/execution/task-instances/019d277e-99e9-763d-b75b-d6cd75120881/heartbeat 
HTTP/1.1" 204 No Content

2026-03-26 02:46:58 [debug    ] Processing heartbeat           
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
pid=1193430 ti_id=019d277e-99e9-763d-b75b-d6cd75120881

2026-03-26 02:47:13 [debug    ] Processing heartbeat           
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
pid=1193430 ti_id=019d277e-99e9-763d-b75b-d6cd75120881

2026-03-26 02:47:21 [debug    ] Processing heartbeat           
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
pid=1193430 ti_id=019d277e-99e9-763d-b75b-d6cd75120881

2026-03-26 02:47:27 [debug    ] Processing heartbeat           
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
pid=1193430 ti_id=019d277e-99e9-763d-b75b-d6cd75120881

2026-03-26 02:47:55 [debug    ] Retrieved current task state   
current_hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
current_pid=1193430 state=running ti_id=019d277e-99e9-763d-b75b-d6cd75120881

2026-03-26 02:47:55 [debug    ] Heartbeat updated              state=running 
ti_id=019d277e-99e9-763d-b75b-d6cd75120881



Taskk failed with below error
2026-03-26 02:47:57 [debug    ] Processing heartbeat           
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
pid=1193430 ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:57 [debug    ] Retrieved current task state   
current_hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
current_pid=1193430 state=None ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:57 [warning  ] Task not in running state      
current_state=None ti_id=019d277e-99e9-763d-b75b-d6cd75120881
INFO:     172.26.145.209:54322 - "PUT 
/execution/task-instances/019d277e-99e9-763d-b75b-d6cd75120881/heartbeat 
HTTP/1.1" 409 Conflict


New task launched for same task id
2026-03-26 02:47:58 [debug    ] Starting task instance run     
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
pid=1201129 ti_id=019d277e-99e9-763d-b75b-d6cd75120881 unixname=airflow


2026-03-26 02:47:58 [debug    ] Starting task instance run     
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
pid=1201129 ti_id=019d277e-99e9-763d-b75b-d6cd75120881 unixname=airflow
2026-03-26 02:47:58 [debug    ] Retrieved task instance details 
dag_id=segment-groupid-1 state=queued 
task_id=ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals 
ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:58 [info     ] Task started                   
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
previous_state=queued ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:58 [info     ] Task instance state updated    rows_affected=1 
ti_id=019d277e-99e9-763d-b75b-d6cd75120881
INFO:     172.26.145.209:54334 - "PATCH 
/execution/task-instances/019d277e-99e9-763d-b75b-d6cd75120881/run HTTP/1.1" 
200 OK
2026-03-26 02:47:58 [debug    ] Processing heartbeat           
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
pid=1201129 ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:58 [debug    ] Retrieved current task state   
current_hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
current_pid=1201129 state=running ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:58 [debug    ] Heartbeat updated              state=running 
ti_id=019d277e-99e9-763d-b75b-d6cd75120881
INFO:     172.26.145.209:54334 - "PUT 
/execution/task-instances/019d277e-99e9-763d-b75b-d6cd75120881/heartbeat 
HTTP/1.1" 204 No Content



2026-03-26 02:48:09 [debug    ] Processing heartbeat           
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
pid=1201129 ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:48:09 [debug    ] Retrieved current task state   
current_hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local 
current_pid=1201129 state=running ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:48:09 [debug    ] Heartbeat updated              state=running 
ti_id=019d277e-99e9-763d-b75b-d6cd75120881
INFO:     172.26.145.209:35238 - "PUT 
/execution/task-instances/019d277e-99e9-763d-b75b-d6cd75120881/heartbeat 
HTTP/1.1" 204 No Content
INFO:     172.26.145.209:35238 - "GET /execution/connections/smtp_default 
HTTP/1.1" 200 OK
2026-03-26 02:48:10 [debug    ] Updating task instance state   new_state=failed 
ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:48:10 [debug    ] Retrieved current task instance state 
max_tries=0 previous_state=running ti_id=019d277e-99e9-763d-b75b-d6cd75120881 
try_number=2
2026-03-26 02:48:10 [debug    ] Processing heartbeat           
hostname=airflow-worker-2.airflow-worker.de-services.svc.cluster.local 
pid=1128588 ti_id=019d278c-5124-76a1-895b-d904aa955a06
2026-03-26 02:48:10 [info     ] Task instance state updated    new_state=failed 
rows_affected=1 ti_id=019d277e-99e9-763d-b75b-d6cd75120881
INFO:     172.26.145.209:35238 - "PATCH 
/execution/task-instances/019d277e-99e9-763d-b75b-d6cd75120881/state HTTP/1.1" 
204 No Content



scheduler
-------------
[2026-03-26T02:14:17.329+0000] {scheduler_job_runner.py:451} INFO - 2 tasks up 
for execution:
        <TaskInstance: 
segment-groupid-2.etl_load_to_ods_group_id_2.branching_etl_step 
schedsuled__2026-03-26T00:45:00+00:00 [scheduled]>
        <TaskInstance: 
segment-groupid-1.ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
 scheduled__2026-03-26T00:15:00+00:00 [scheduled]>
[2026-03-26T02:14:17.330+0000] {scheduler_job_runner.py:662} INFO - Setting the 
following tasks to queued state:
        <TaskInstance: 
segment-groupid-2.etl_load_to_ods_group_id_2.branching_etl_step 
scheduled__2026-03-26T00:45:00+00:00 [scheduled]>
        <TaskInstance: 
segment-groupid-1.ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
 scheduled__2026-03-26T00:15:00+00:00 [scheduled]>
[2026-03-26T02:14:17.333+0000] {scheduler_job_runner.py:768} INFO - Trying to 
enqueue tasks: [<TaskInstance: 
segment-groupid-2.etl_load_to_ods_group_id_2.branching_etl_step 
scheduled__2026-03-26T00:45:00+00:00 [scheduled]>, <TaskInstance: 
segment-groupid-1.ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
 scheduled__2026-03-26T00:15:00+00:00 [scheduled]>] for executor: 
CeleryExecutor(parallelism=32)
[2026-03-26T02:14:17.462+0000] {scheduler_job_runner.py:811} INFO - Received 
executor event with state queued for task instance 
TaskInstanceKey(dag_id='segment-groupid-1', 
task_id='ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals', 
run_id='scheduled__2026-03-26T00:15:00+00:00', try_number=1, map_index=-1)

[2026-03-26T02:47:55.913+0000] {scheduler_job_runner.py:2210} INFO - Reset the 
following 2 orphaned TaskInstances:
        <TaskInstance: 
segment-groupid-1.ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
 scheduled__2026-03-26T00:15:00+00:00 [running]>
        <TaskInstance: Document_pipeline_kube.CORE.document_parser 
scheduled__2026-03-26T00:30:00+00:00 [running]>
        
[2026-03-26T02:47:58.140+0000] {scheduler_job_runner.py:451} INFO - 2 tasks up 
for execution:
        <TaskInstance: 
segment-groupid-1.ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
 scheduled__2026-03-26T00:15:00+00:00 [scheduled]>
[2026-03-26T02:47:58.141+0000] {scheduler_job_runner.py:662} INFO - Setting the 
following tasks to queued state:
        <TaskInstance: 
segment-groupid-1.ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
 scheduled__2026-03-26T00:15:00+00:00 [scheduled]>
[2026-03-26T02:47:58.147+0000] {scheduler_job_runner.py:768} INFO - Trying to 
enqueue tasks: [<TaskInstance: 
segment-groupid-1.ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
 scheduled__2026-03-26T00:15:00+00:00 [scheduled]>, <TaskInstance: 
Document_pipeline_kube.CORE.document_parser 
scheduled__2026-03-26T00:30:00+00:00 [scheduled]>] for executor: 
CeleryExecutor(parallelism=32)
[2026-03-26T02:47:58.322+0000] {scheduler_job_runner.py:811} INFO - Received 
executor event with state queued for task instance 
TaskInstanceKey(dag_id='segment-groupid-1', 
task_id='ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals', 
run_id='scheduled__2026-03-26T00:15:00+00:00', try_number=2, map_index=-1)

[2026-03-26T02:47:58.336+0000] {scheduler_job_runner.py:843} INFO - Setting 
external_id for <TaskInstance: 
segment-groupid-1.ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
 scheduled__2026-03-26T00:15:00+00:00 [running]> to 
3fb3128a-86c9-43bc-a1fb-3b0a6dcbcec4
[2026-03-26T02:48:04.417+0000] {scheduler_job_runner.py:811} INFO - Received 
executor event with state success for task instance 
TaskInstanceKey(dag_id='segment-groupid-1', 
task_id='ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals', 
run_id='scheduled__2026-03-26T00:15:00+00:00', try_number=1, map_index=-1)
[2026-03-26T02:48:04.429+0000] {scheduler_job_runner.py:853} INFO - 
TaskInstance Finished: dag_id=segment-groupid-1, 
task_id=ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals, 
run_id=scheduled__2026-03-26T00:15:00+00:00, map_index=-1, 
run_start_date=2026-03-26 02:47:58.263013+00:00, run_end_date=None, 
run_duration=None, state=running, executor=CeleryExecutor(parallelism=32), 
executor_state=success, try_number=1, max_tries=0, pool=default_pool, 
queue=default, priority_weight=13, operator=SparkKubernetesOperator, 
queued_dttm=2026-03-26 02:47:58.142052+00:00, scheduled_dttm=2026-03-26 
02:47:58.060945+00:00,queued_by_job_id=2414479, pid=1201129
[2026-03-26T02:48:12.148+0000] {scheduler_job_runner.py:811} INFO - Received 
executor event with state success for task instance 
TaskInstanceKey(dag_id='segment-groupid-1', 
task_id='ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals', 
run_id='scheduled__2026-03-26T00:15:00+00:00', try_number=2, map_index=-1)
[2026-03-26T02:48:12.159+0000] {scheduler_job_runner.py:853} INFO - 
TaskInstance Finished: dag_id=segment-groupid-1, 
task_id=ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals, 
run_id=scheduled__2026-03-26T00:15:00+00:00, map_index=-1, 
run_start_date=2026-03-26 02:47:58.263013+00:00, run_end_date=2026-03-26 
02:48:10.755668+00:00, run_duration=12.4927, state=failed, 
executor=CeleryExecutor(parallelism=32), executor_state=success, try_number=2, 
max_tries=0, pool=default_pool, queue=default, priority_weight=13, 
operator=SparkKubernetesOperator, queued_dttm=2026-03-26 02:47:58.142052+00:00, 
scheduled_dttm=2026-03-26 02:47:58.060945+00:00,queued_by_job_id=2414479, 
pid=1201129



GitHub link: 
https://github.com/apache/airflow/discussions/63298#discussioncomment-16461428

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to