Enrico Minack created SPARK-56238:
-------------------------------------

             Summary: Spark app id of executor kubernetes pods differ from 
Spark driver
                 Key: SPARK-56238
                 URL: https://issues.apache.org/jira/browse/SPARK-56238
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 4.2.0
            Reporter: Enrico Minack


Running the Spark Shell in a kubernetes cluster creates executor pods that have 
a different app id selector value than the driver's app id. This causes the 
driver to kill the executors after some time.

Reproduction:

Create a minikube cluster:
{code:java}
minikube start
kubectl create serviceaccount spark-sa
kubectl create clusterrolebinding spark-role --clusterrole=edit 
--serviceaccount=default:spark-sa --namespace=default
kubectl cluster-info
{code}
This outputs something like:
{code:java}
Kubernetes control plane is running at https://192.168.49.2:8443";
{code}
Memorize this URL:
{code:java}
k8s_url="https://192.168.49.2:8443";
{code}
Build Spark binaries and Docker image, load image into minikube:
{code:java}
./dev/make-distribution.sh -Pkubernetes -Phadoop-cloud && (cd dist && 
SPARK_HOME=$(pwd) ./bin/docker-image-tool.sh -t "latest" build) && docker save 
spark:latest -o spark.tar && minikube image load spark.tar
{code}
Run Spark Shell:
{code:java}
echo "spark.range(10).mapPartitions { it => Thread.sleep(60000); it }.collect" 
| ./bin/spark-shell     --master k8s://$k8s_url     --conf 
spark.kubernetes.container.image=spark:latest     --conf 
spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa     --conf 
spark.kubernetes.executor.missingPodDetectDelta=1000
{code}
The driver says:
{code:java}
Spark context available as 'sc' (master = k8s://https://192.168.49.2:8443, app 
id = spark-c0f802279d8146c29b1ef3467b467590).
{code}
The executor pods say:
{code:java}
kubectl describe "pod/$(kubectl get pods | grep exec | grep Running | head -n 1 
| cut -d " " -f 1)" | grep selec
spark-app-selector=spark-6c8d69be4f00410b884fd6e6417b872a
{code}
this mismatch causes:
{code:java}
26/03/26 11:40:29 ERROR dispatcher-CoarseGrainedScheduler 
org.apache.spark.scheduler.TaskSchedulerImpl: Lost executor 1 on 10.244.0.55: 
The executor with ID 1 (registered at 1774521602640 ms) was not found in the 
cluster at the polling time (1774521629051 ms) which is after the accepted 
detect delta time (1000 ms) configured by 
`spark.kubernetes.executor.missingPodDetectDelta`. The executor may have been 
deleted but the driver missed the deletion event. Marking this executor as 
failed.
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to