GitHub user rcrchawla added a comment to the discussion: Airflow task failed
but spark kube app is running
Hi @shaealh
As discussed we have lowered down the concurrency and it is working fine but we
faced one challenge which is one time only where scheduler is launching a task
(Spark Kubernetes Operator) which is working fine for few minutes and after
that worker is unable to send the signal and it retried 5 times and worker got
the message "Server error task instance is already running somewhere else" and
where it launched another pid for same task _id now we see in the airflow logs
this task wait for getting the resource to launch the driver where we have set
the property "startup_timeout_seconds" to 40 minutes but in the logs after it
waiting it killed the task now not sure why it has killed the task.
Could you please suggest anything by seeing the below summarized logs :
[Uploading 20260326.zip…]()
airflow-worker-1
2026-03-26 02:14 -- 173a1136-b645-4d67-a13a-a485a3704efc --
load_track_group_id_nerd_portals -- 019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:26.905924 [warning ] Failed to send heartbeat. Will be
retried [supervisor] failed_heartbeats=1 max_retries=3
ti_id=UUID('019d277e-99e9-763d-b75b-d6cd75120881')
2026-03-26 02:47:57.527466 [warning ] Server error
[airflow.sdk.api.client] detail={'detail': {'reason': 'not_running', 'message':
'TI is no longer in the running state and task should terminate',
'current_state': None}}
2026-03-26 02:47:57.527755 [error ] Server indicated the task shouldn't be
running anymore [supervisor] detail={'detail': {'reason': 'not_running',
'message': 'TI is no longer in the running state and task should terminate',
'current_state': None}} status_code=409
ti_id=UUID('019d277e-99e9-763d-b75b-d6cd75120881')
2026-03-26 02:47:58.194844 [info ] Task
execute_workload[3fb3128a-86c9-43bc-a1fb-3b0a6dcbcec4] received
[celery.worker.strategy]
2026-03-26 02:47:58.228603 [info ] [3fb3128a-86c9-43bc-a1fb-3b0a6dcbcec4]
Executing workload in Celery: token='eyJ***'
ti=TaskInstance(id=UUID('019d277e-99e9-763d-b75b-d6cd75120881'),
task_id='ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals',
dag_id='segment-groupid-1', run_id='scheduled__2026-03-26T00:15:00+00:00',
try_number=2, map_index=-1, pool_slots=1, queue='default', priority_weight=13,
executor_config=None, parent_context_carrier={}, context_carrier={},
queued_dttm=datetime.datetime(2026, 3, 26, 2, 14, 17, 330486,
tzinfo=TzInfo(UTC))) dag_rel_path=PurePosixPath('segment_groupid_1.py')
bundle_info=BundleInfo(name='dags-folder', version=None)
log_path='dag_id=segment-groupid-1/run_id=scheduled__2026-03-26T00:15:00+00:00/task_id=ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals/attempt=2.log'
type='ExecuteTask' [airflow.providers.celery.executors.celery_executor_utils]
2026-03-26 02:47:58.256886 [info ] Secrets backends loaded for worker
[supervisor]
2026-03-26 02:48:02.577882 [info ] Task
execute_workload[173a1136-b645-4d67-a13a-a485a3704efc] succeeded in
2025.1995898000896s: None [celery.app.trace]
2026-03-26 02:48:10.794292 [info ] Task finished
[supervisor] duration=12.537848281674087 exit_code=0 final_state=failed
2026-03-26 02:48:10.810689 [info ] Task
execute_workload[3fb3128a-86c9-43bc-a1fb-3b0a6dcbcec4] succeeded in
12.614215551875532s: None [celery.app.trace]
2026-03-26 02:49:55.420123 [info ] Task finished
[supervisor] duration=36.28673866484314 exit_code=0 final_state=success
--------------------------------------------------------------------------------------------------------
api_server
---------------------------------------------
2026-03-26 02:45:47 [debug ] Processing heartbeat
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
pid=1193430 ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:45:47 [debug ] Retrieved current task state
current_hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
current_pid=1193430 state=running ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:45:47 [debug ] Heartbeat updated state=running
ti_id=019d277e-99e9-763d-b75b-d6cd75120881
INFO: 172.26.145.209:60936 - "PUT
/execution/task-instances/019d277e-99e9-763d-b75b-d6cd75120881/heartbeat
HTTP/1.1" 204 No Content
2026-03-26 02:46:38 [debug ] Retrieved current task state
current_hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
current_pid=1193430 state=running ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:46:38 [debug ] Heartbeat updated state=running
ti_id=019d277e-99e9-763d-b75b-d6cd75120881
INFO: 172.26.145.209:50958 - "PUT
/execution/task-instances/019d277e-99e9-763d-b75b-d6cd75120881/heartbeat
HTTP/1.1" 204 No Content
2026-03-26 02:46:58 [debug ] Processing heartbeat
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
pid=1193430 ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:13 [debug ] Processing heartbeat
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
pid=1193430 ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:21 [debug ] Processing heartbeat
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
pid=1193430 ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:27 [debug ] Processing heartbeat
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
pid=1193430 ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:55 [debug ] Retrieved current task state
current_hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
current_pid=1193430 state=running ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:55 [debug ] Heartbeat updated state=running
ti_id=019d277e-99e9-763d-b75b-d6cd75120881
Taskk failed with below error
2026-03-26 02:47:57 [debug ] Processing heartbeat
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
pid=1193430 ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:57 [debug ] Retrieved current task state
current_hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
current_pid=1193430 state=None ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:57 [warning ] Task not in running state
current_state=None ti_id=019d277e-99e9-763d-b75b-d6cd75120881
INFO: 172.26.145.209:54322 - "PUT
/execution/task-instances/019d277e-99e9-763d-b75b-d6cd75120881/heartbeat
HTTP/1.1" 409 Conflict
New task launched for same task id
2026-03-26 02:47:58 [debug ] Starting task instance run
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
pid=1201129 ti_id=019d277e-99e9-763d-b75b-d6cd75120881 unixname=airflow
2026-03-26 02:47:58 [debug ] Starting task instance run
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
pid=1201129 ti_id=019d277e-99e9-763d-b75b-d6cd75120881 unixname=airflow
2026-03-26 02:47:58 [debug ] Retrieved task instance details
dag_id=segment-groupid-1 state=queued
task_id=ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:58 [info ] Task started
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
previous_state=queued ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:58 [info ] Task instance state updated rows_affected=1
ti_id=019d277e-99e9-763d-b75b-d6cd75120881
INFO: 172.26.145.209:54334 - "PATCH
/execution/task-instances/019d277e-99e9-763d-b75b-d6cd75120881/run HTTP/1.1"
200 OK
2026-03-26 02:47:58 [debug ] Processing heartbeat
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
pid=1201129 ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:58 [debug ] Retrieved current task state
current_hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
current_pid=1201129 state=running ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:47:58 [debug ] Heartbeat updated state=running
ti_id=019d277e-99e9-763d-b75b-d6cd75120881
INFO: 172.26.145.209:54334 - "PUT
/execution/task-instances/019d277e-99e9-763d-b75b-d6cd75120881/heartbeat
HTTP/1.1" 204 No Content
2026-03-26 02:48:09 [debug ] Processing heartbeat
hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
pid=1201129 ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:48:09 [debug ] Retrieved current task state
current_hostname=airflow-worker-1.airflow-worker.de-services.svc.cluster.local
current_pid=1201129 state=running ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:48:09 [debug ] Heartbeat updated state=running
ti_id=019d277e-99e9-763d-b75b-d6cd75120881
INFO: 172.26.145.209:35238 - "PUT
/execution/task-instances/019d277e-99e9-763d-b75b-d6cd75120881/heartbeat
HTTP/1.1" 204 No Content
INFO: 172.26.145.209:35238 - "GET /execution/connections/smtp_default
HTTP/1.1" 200 OK
2026-03-26 02:48:10 [debug ] Updating task instance state new_state=failed
ti_id=019d277e-99e9-763d-b75b-d6cd75120881
2026-03-26 02:48:10 [debug ] Retrieved current task instance state
max_tries=0 previous_state=running ti_id=019d277e-99e9-763d-b75b-d6cd75120881
try_number=2
2026-03-26 02:48:10 [debug ] Processing heartbeat
hostname=airflow-worker-2.airflow-worker.de-services.svc.cluster.local
pid=1128588 ti_id=019d278c-5124-76a1-895b-d904aa955a06
2026-03-26 02:48:10 [info ] Task instance state updated new_state=failed
rows_affected=1 ti_id=019d277e-99e9-763d-b75b-d6cd75120881
INFO: 172.26.145.209:35238 - "PATCH
/execution/task-instances/019d277e-99e9-763d-b75b-d6cd75120881/state HTTP/1.1"
204 No Content
scheduler
-------------
[2026-03-26T02:14:17.329+0000] {scheduler_job_runner.py:451} INFO - 2 tasks up
for execution:
<TaskInstance:
segment-groupid-2.etl_load_to_ods_group_id_2.branching_etl_step
schedsuled__2026-03-26T00:45:00+00:00 [scheduled]>
<TaskInstance:
segment-groupid-1.ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
scheduled__2026-03-26T00:15:00+00:00 [scheduled]>
[2026-03-26T02:14:17.330+0000] {scheduler_job_runner.py:662} INFO - Setting the
following tasks to queued state:
<TaskInstance:
segment-groupid-2.etl_load_to_ods_group_id_2.branching_etl_step
scheduled__2026-03-26T00:45:00+00:00 [scheduled]>
<TaskInstance:
segment-groupid-1.ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
scheduled__2026-03-26T00:15:00+00:00 [scheduled]>
[2026-03-26T02:14:17.333+0000] {scheduler_job_runner.py:768} INFO - Trying to
enqueue tasks: [<TaskInstance:
segment-groupid-2.etl_load_to_ods_group_id_2.branching_etl_step
scheduled__2026-03-26T00:45:00+00:00 [scheduled]>, <TaskInstance:
segment-groupid-1.ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
scheduled__2026-03-26T00:15:00+00:00 [scheduled]>] for executor:
CeleryExecutor(parallelism=32)
[2026-03-26T02:14:17.462+0000] {scheduler_job_runner.py:811} INFO - Received
executor event with state queued for task instance
TaskInstanceKey(dag_id='segment-groupid-1',
task_id='ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals',
run_id='scheduled__2026-03-26T00:15:00+00:00', try_number=1, map_index=-1)
[2026-03-26T02:47:55.913+0000] {scheduler_job_runner.py:2210} INFO - Reset the
following 2 orphaned TaskInstances:
<TaskInstance:
segment-groupid-1.ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
scheduled__2026-03-26T00:15:00+00:00 [running]>
<TaskInstance: Document_pipeline_kube.CORE.document_parser
scheduled__2026-03-26T00:30:00+00:00 [running]>
[2026-03-26T02:47:58.140+0000] {scheduler_job_runner.py:451} INFO - 2 tasks up
for execution:
<TaskInstance:
segment-groupid-1.ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
scheduled__2026-03-26T00:15:00+00:00 [scheduled]>
[2026-03-26T02:47:58.141+0000] {scheduler_job_runner.py:662} INFO - Setting the
following tasks to queued state:
<TaskInstance:
segment-groupid-1.ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
scheduled__2026-03-26T00:15:00+00:00 [scheduled]>
[2026-03-26T02:47:58.147+0000] {scheduler_job_runner.py:768} INFO - Trying to
enqueue tasks: [<TaskInstance:
segment-groupid-1.ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
scheduled__2026-03-26T00:15:00+00:00 [scheduled]>, <TaskInstance:
Document_pipeline_kube.CORE.document_parser
scheduled__2026-03-26T00:30:00+00:00 [scheduled]>] for executor:
CeleryExecutor(parallelism=32)
[2026-03-26T02:47:58.322+0000] {scheduler_job_runner.py:811} INFO - Received
executor event with state queued for task instance
TaskInstanceKey(dag_id='segment-groupid-1',
task_id='ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals',
run_id='scheduled__2026-03-26T00:15:00+00:00', try_number=2, map_index=-1)
[2026-03-26T02:47:58.336+0000] {scheduler_job_runner.py:843} INFO - Setting
external_id for <TaskInstance:
segment-groupid-1.ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals
scheduled__2026-03-26T00:15:00+00:00 [running]> to
3fb3128a-86c9-43bc-a1fb-3b0a6dcbcec4
[2026-03-26T02:48:04.417+0000] {scheduler_job_runner.py:811} INFO - Received
executor event with state success for task instance
TaskInstanceKey(dag_id='segment-groupid-1',
task_id='ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals',
run_id='scheduled__2026-03-26T00:15:00+00:00', try_number=1, map_index=-1)
[2026-03-26T02:48:04.429+0000] {scheduler_job_runner.py:853} INFO -
TaskInstance Finished: dag_id=segment-groupid-1,
task_id=ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals,
run_id=scheduled__2026-03-26T00:15:00+00:00, map_index=-1,
run_start_date=2026-03-26 02:47:58.263013+00:00, run_end_date=None,
run_duration=None, state=running, executor=CeleryExecutor(parallelism=32),
executor_state=success, try_number=1, max_tries=0, pool=default_pool,
queue=default, priority_weight=13, operator=SparkKubernetesOperator,
queued_dttm=2026-03-26 02:47:58.142052+00:00, scheduled_dttm=2026-03-26
02:47:58.060945+00:00,queued_by_job_id=2414479, pid=1201129
[2026-03-26T02:48:12.148+0000] {scheduler_job_runner.py:811} INFO - Received
executor event with state success for task instance
TaskInstanceKey(dag_id='segment-groupid-1',
task_id='ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals',
run_id='scheduled__2026-03-26T00:15:00+00:00', try_number=2, map_index=-1)
[2026-03-26T02:48:12.159+0000] {scheduler_job_runner.py:853} INFO -
TaskInstance Finished: dag_id=segment-groupid-1,
task_id=ODS_Load_group_id_nerd_portals.load_track_group_id_nerd_portals,
run_id=scheduled__2026-03-26T00:15:00+00:00, map_index=-1,
run_start_date=2026-03-26 02:47:58.263013+00:00, run_end_date=2026-03-26
02:48:10.755668+00:00, run_duration=12.4927, state=failed,
executor=CeleryExecutor(parallelism=32), executor_state=success, try_number=2,
max_tries=0, pool=default_pool, queue=default, priority_weight=13,
operator=SparkKubernetesOperator, queued_dttm=2026-03-26 02:47:58.142052+00:00,
scheduled_dttm=2026-03-26 02:47:58.060945+00:00,queued_by_job_id=2414479,
pid=1201129
GitHub link:
https://github.com/apache/airflow/discussions/63298#discussioncomment-16461428
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]