[ 
https://issues.apache.org/jira/browse/SPARK-56238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-56238:
----------------------------------
    Description: 
Running the Spark Shell in a kubernetes cluster creates executor pods that have 
a different app id selector value than the driver's app id. This causes the 
driver to kill the executors after some time.

Reproduction:

Create a minikube cluster:
{code:java}
minikube start
kubectl create serviceaccount spark-sa
kubectl create clusterrolebinding spark-role --clusterrole=edit 
--serviceaccount=default:spark-sa --namespace=default
kubectl cluster-info
{code}
This outputs something like:
{code:java}
Kubernetes control plane is running at https://192.168.49.2:8443";
{code}
Memorize this URL:
{code:java}
k8s_url="https://192.168.49.2:8443";
{code}
Build Spark binaries and Docker image, load image into minikube:
{code:java}
./dev/make-distribution.sh -Pkubernetes -Phadoop-cloud && (cd dist && 
SPARK_HOME=$(pwd) ./bin/docker-image-tool.sh -t "latest" build) && docker save 
spark:latest -o spark.tar && minikube image load spark.tar
{code}
Run Spark Shell:
{code:java}
echo "spark.range(10).mapPartitions { it => Thread.sleep(60000); it }.collect" 
| ./bin/spark-shell     --master k8s://$k8s_url     --conf 
spark.kubernetes.container.image=spark:latest     --conf 
spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa     --conf 
spark.kubernetes.executor.missingPodDetectDelta=1000
{code}
The driver says:
{code:java}
Spark context available as 'sc' (master = k8s://https://192.168.49.2:8443, app 
id = spark-c0f802279d8146c29b1ef3467b467590).
{code}
The executor pods say:
{code:java}
kubectl describe "pod/$(kubectl get pods | grep exec | grep Running | head -n 1 
| cut -d " " -f 1)" | grep selec
spark-app-selector=spark-6c8d69be4f00410b884fd6e6417b872a
{code}
this mismatch causes:
{code:java}
26/03/26 11:40:29 ERROR dispatcher-CoarseGrainedScheduler 
org.apache.spark.scheduler.TaskSchedulerImpl: Lost executor 1 on 10.244.0.55: 
The executor with ID 1 (registered at 1774521602640 ms) was not found in the 
cluster at the polling time (1774521629051 ms) which is after the accepted 
detect delta time (1000 ms) configured by 
`spark.kubernetes.executor.missingPodDetectDelta`. The executor may have been 
deleted but the driver missed the deletion event. Marking this executor as 
failed.{code}

This looks related: SPARK-25922

  was:
Running the Spark Shell in a kubernetes cluster creates executor pods that have 
a different app id selector value than the driver's app id. This causes the 
driver to kill the executors after some time.

Reproduction:

Create a minikube cluster:
{code:java}
minikube start
kubectl create serviceaccount spark-sa
kubectl create clusterrolebinding spark-role --clusterrole=edit 
--serviceaccount=default:spark-sa --namespace=default
kubectl cluster-info
{code}
This outputs something like:
{code:java}
Kubernetes control plane is running at https://192.168.49.2:8443";
{code}
Memorize this URL:
{code:java}
k8s_url="https://192.168.49.2:8443";
{code}
Build Spark binaries and Docker image, load image into minikube:
{code:java}
./dev/make-distribution.sh -Pkubernetes -Phadoop-cloud && (cd dist && 
SPARK_HOME=$(pwd) ./bin/docker-image-tool.sh -t "latest" build) && docker save 
spark:latest -o spark.tar && minikube image load spark.tar
{code}
Run Spark Shell:
{code:java}
echo "spark.range(10).mapPartitions { it => Thread.sleep(60000); it }.collect" 
| ./bin/spark-shell     --master k8s://$k8s_url     --conf 
spark.kubernetes.container.image=spark:latest     --conf 
spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa     --conf 
spark.kubernetes.executor.missingPodDetectDelta=1000
{code}
The driver says:
{code:java}
Spark context available as 'sc' (master = k8s://https://192.168.49.2:8443, app 
id = spark-c0f802279d8146c29b1ef3467b467590).
{code}
The executor pods say:
{code:java}
kubectl describe "pod/$(kubectl get pods | grep exec | grep Running | head -n 1 
| cut -d " " -f 1)" | grep selec
spark-app-selector=spark-6c8d69be4f00410b884fd6e6417b872a
{code}
this mismatch causes:
{code:java}
26/03/26 11:40:29 ERROR dispatcher-CoarseGrainedScheduler 
org.apache.spark.scheduler.TaskSchedulerImpl: Lost executor 1 on 10.244.0.55: 
The executor with ID 1 (registered at 1774521602640 ms) was not found in the 
cluster at the polling time (1774521629051 ms) which is after the accepted 
detect delta time (1000 ms) configured by 
`spark.kubernetes.executor.missingPodDetectDelta`. The executor may have been 
deleted but the driver missed the deletion event. Marking this executor as 
failed.
{code}


> Spark app id of executor kubernetes pods differ from Spark driver
> -----------------------------------------------------------------
>
>                 Key: SPARK-56238
>                 URL: https://issues.apache.org/jira/browse/SPARK-56238
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 4.2.0
>            Reporter: Enrico Minack
>            Priority: Critical
>
> Running the Spark Shell in a kubernetes cluster creates executor pods that 
> have a different app id selector value than the driver's app id. This causes 
> the driver to kill the executors after some time.
> Reproduction:
> Create a minikube cluster:
> {code:java}
> minikube start
> kubectl create serviceaccount spark-sa
> kubectl create clusterrolebinding spark-role --clusterrole=edit 
> --serviceaccount=default:spark-sa --namespace=default
> kubectl cluster-info
> {code}
> This outputs something like:
> {code:java}
> Kubernetes control plane is running at https://192.168.49.2:8443";
> {code}
> Memorize this URL:
> {code:java}
> k8s_url="https://192.168.49.2:8443";
> {code}
> Build Spark binaries and Docker image, load image into minikube:
> {code:java}
> ./dev/make-distribution.sh -Pkubernetes -Phadoop-cloud && (cd dist && 
> SPARK_HOME=$(pwd) ./bin/docker-image-tool.sh -t "latest" build) && docker 
> save spark:latest -o spark.tar && minikube image load spark.tar
> {code}
> Run Spark Shell:
> {code:java}
> echo "spark.range(10).mapPartitions { it => Thread.sleep(60000); it 
> }.collect" | ./bin/spark-shell     --master k8s://$k8s_url     --conf 
> spark.kubernetes.container.image=spark:latest     --conf 
> spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa     --conf 
> spark.kubernetes.executor.missingPodDetectDelta=1000
> {code}
> The driver says:
> {code:java}
> Spark context available as 'sc' (master = k8s://https://192.168.49.2:8443, 
> app id = spark-c0f802279d8146c29b1ef3467b467590).
> {code}
> The executor pods say:
> {code:java}
> kubectl describe "pod/$(kubectl get pods | grep exec | grep Running | head -n 
> 1 | cut -d " " -f 1)" | grep selec
> spark-app-selector=spark-6c8d69be4f00410b884fd6e6417b872a
> {code}
> this mismatch causes:
> {code:java}
> 26/03/26 11:40:29 ERROR dispatcher-CoarseGrainedScheduler 
> org.apache.spark.scheduler.TaskSchedulerImpl: Lost executor 1 on 10.244.0.55: 
> The executor with ID 1 (registered at 1774521602640 ms) was not found in the 
> cluster at the polling time (1774521629051 ms) which is after the accepted 
> detect delta time (1000 ms) configured by 
> `spark.kubernetes.executor.missingPodDetectDelta`. The executor may have been 
> deleted but the driver missed the deletion event. Marking this executor as 
> failed.{code}
> This looks related: SPARK-25922



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to