[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

Kiruthiga (Jira) Mon, 13 Jul 2020 23:14:10 -0700


    [ 
https://issues.apache.org/jira/browse/AIRFLOW-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157168#comment-17157168
 ]


Kiruthiga commented on AIRFLOW-6014:
------------------------------------

I am facing the same issue.

The task pods that are preempted by Kubernetes to accommodate critical system 
pods are marked as "queued" or " failed" in the Airflow.I am concentrating on 
the queued task as i am not sure why task is failing on preemption.

In my case, the kubernetes scheduler marks the task(preempted pod) for 
"up_for_reschedule" state but the same is not updated in the Airflow 
database/Webserver UI.

Attaching the screenshots for reference.

*Kubernetes Scheduler Log*

!image-2020-07-14-11-27-21-277.png!

*Airflow Webserver* - the task *sleep_for_1* is still in queued state(expected 
state is "up_for_reschedule")

!image-2020-07-14-11-29-14-334.png!

 

I have started debugging the Airflow code. The mentioned log(screenshot 1 - 
from kubernetes scheduler) is from airflow/jobs/*scheduer_job*.py file, method 
*_process_executor_events*. I doubt the state *State.UP_FOR_RESCHEDULE* is not 
handled in this method.

 

Please correct me if i my understanding is wrong and help me in fixing this 
issue.

 

> Kubernetes executor - handle preempted deleted pods - queued tasks
> ------------------------------------------------------------------
>
>                 Key: AIRFLOW-6014
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6014
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: executor-kubernetes
>    Affects Versions: 1.10.6
>            Reporter: afusr
>            Assignee: Daniel Imberman
>            Priority: Minor
>             Fix For: 1.10.10
>
>         Attachments: image-2020-07-14-11-27-21-277.png, 
> image-2020-07-14-11-29-14-334.png
>
>
> We have encountered an issue whereby when using the kubernetes executor, and 
> using autoscaling, airflow pods are preempted and airflow never attempts to 
> rerun these pods. 
> This is partly as a result of having the following set on the pod spec:
> restartPolicy: Never
> This makes sense as if a pod fails when running a task, we don't want 
> kubernetes to retry it, as this should be controlled by airflow. 
> What we believe happens is that when a new node is added by autoscaling, 
> kubernetes schedules a number of airflow pods onto the new node, as well as 
> any pods required by k8s/daemon sets. As these are higher priority, the 
> Airflow pods are preempted, and deleted. You see messages such as:
>  
> Preempted by kube-system/ip-masq-agent-xz77q on node 
> gke-some--airflow-00000000-node-1ltl
>  
> Within the kubernetes executor, these pods end up in a status of pending and 
> an event of deleted is received but not handled. 
> The end result is tasks remain in a queued state forever. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (AIRFLOW-6014) Kubernetes executor - handle preempted deleted pods - queued tasks

Reply via email to