rubanolha opened a new issue, #37041:
URL: https://github.com/apache/airflow/issues/37041

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### If "Other Airflow 2 version" selected, which one?
   
   2.7.3
   
   ### What happened?
   
   Note: our executor: KubernetesExecutor, task retries =1
   I encountered an issue several times where a task was terminated externally 
at the beginning of execution, marked as failed, and not retried, despite 
having the retry parameter set to 1. The UI displays only one attempt, but upon 
further investigation, I observed discrepancies in the logs, Airflow database, 
and metrics.
   
   Upon checking the logs, Airflow database, and metrics, it became evident 
that there are two records in the task_fail table and the 
airflow.scheduler.tasks.killed_externally metric. 
   public.task_fail:
   id,task_id,dag_id,start_date,end_date,duration,map_index,run_id
   21007,xxx_task_id,,2024-01-26 
09:08:19.937939,,-1,scheduled__2024-01-25T08:30:00+00:00
   21008,xxx_task_id,,2024-01-26 
09:08:54.288874,,-1,scheduled__2024-01-25T08:30:00+00:00
   
   Surprisingly, there are no records in the airflow.ti.finish metric.
   The task was submitted once to the Kubernetes pod, as confirmed by both 
Airflow logs and Kubernetes logs. However, the Airflow logs contain the message 
"Was the task killed externally?" appearing twice.
   
   The task also has on_failure_callback. The slack message was received which 
means on_failure_callback was triggered. 
   
   If the task is terminated after it starts to post some logs, it has "Task 
received SIGTERM signal" and also "Marking task as UP_FOR_RETRY." It is not 
tracked with airflow.scheduler.tasks.killed_externally metric.
   
   Airflow is deployed on Kubernetes, with a single scheduler pod using default 
configurations. There are four processes still running in the scheduler 
identified as "python /home/airflow/.local/bin/airflow scheduler -n -1."
   
   Logs: [logs.csv](https://github.com/apache/airflow/files/14071250/logs.csv)
   
   
   ### What you think should happen instead?
   
   1. If a task is externally terminated, it should automatically retry based 
on the configured retry settings.
   2. Even if a task killed with SIGTERM, it should be still tracked with 
airflow.scheduler.tasks.killed_externally metric
   
   ### How to reproduce
   
   1. Trigger a task to run ( reproducable both with manual through UI or 
scheduled)
   2. Find pod which was created for this task with `kubectl get pods --watch`
   3. Just after this kill the pod with `kubectl delete pod pod_name` (it 
should be killed before a task starts to print some logs) 
   
   Worker Pod status from `kubectl get pods --watch`
   xxx_pod_name         0/1     ContainerCreating   0               0s
   xxx_pod_name        1/1     Running             0               1s
   xxx_pod_name         1/1     Terminating         0               6s
   
   There should be no Kubernetes configurations in the pod template that enable 
grace termination (terminationGracePeriodSeconds=0) for a pod.
   
   
   ### Operating System
   
   PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" 
VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian 
HOME_URL="https://www.debian.org/"; SUPPORT_URL="https://www.debian.org/support"; 
BUG_REPORT_URL="https://bugs.debian.org/";
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-cncf-kubernetes = "^7.13.0"
   airflow:2.7.3-python3.11
   k8s cluster version 1.28
   https://airflow-helm.github.io/charts v8.8.0
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Deployed with helm https://airflow-helm.github.io/charts v8.8.0
   
   ### Anything else?
   
   Logs are attached in a file in description. 
   The task should be killed at the beginning before it starts to post a logs.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to