Enrico Minack created SPARK-56238:
-------------------------------------
Summary: Spark app id of executor kubernetes pods differ from
Spark driver
Key: SPARK-56238
URL: https://issues.apache.org/jira/browse/SPARK-56238
Project: Spark
Issue Type: Bug
Components: Kubernetes
Affects Versions: 4.2.0
Reporter: Enrico Minack
Running the Spark Shell in a kubernetes cluster creates executor pods that have
a different app id selector value than the driver's app id. This causes the
driver to kill the executors after some time.
Reproduction:
Create a minikube cluster:
{code:java}
minikube start
kubectl create serviceaccount spark-sa
kubectl create clusterrolebinding spark-role --clusterrole=edit
--serviceaccount=default:spark-sa --namespace=default
kubectl cluster-info
{code}
This outputs something like:
{code:java}
Kubernetes control plane is running at https://192.168.49.2:8443"
{code}
Memorize this URL:
{code:java}
k8s_url="https://192.168.49.2:8443"
{code}
Build Spark binaries and Docker image, load image into minikube:
{code:java}
./dev/make-distribution.sh -Pkubernetes -Phadoop-cloud && (cd dist &&
SPARK_HOME=$(pwd) ./bin/docker-image-tool.sh -t "latest" build) && docker save
spark:latest -o spark.tar && minikube image load spark.tar
{code}
Run Spark Shell:
{code:java}
echo "spark.range(10).mapPartitions { it => Thread.sleep(60000); it }.collect"
| ./bin/spark-shell --master k8s://$k8s_url --conf
spark.kubernetes.container.image=spark:latest --conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa --conf
spark.kubernetes.executor.missingPodDetectDelta=1000
{code}
The driver says:
{code:java}
Spark context available as 'sc' (master = k8s://https://192.168.49.2:8443, app
id = spark-c0f802279d8146c29b1ef3467b467590).
{code}
The executor pods say:
{code:java}
kubectl describe "pod/$(kubectl get pods | grep exec | grep Running | head -n 1
| cut -d " " -f 1)" | grep selec
spark-app-selector=spark-6c8d69be4f00410b884fd6e6417b872a
{code}
this mismatch causes:
{code:java}
26/03/26 11:40:29 ERROR dispatcher-CoarseGrainedScheduler
org.apache.spark.scheduler.TaskSchedulerImpl: Lost executor 1 on 10.244.0.55:
The executor with ID 1 (registered at 1774521602640 ms) was not found in the
cluster at the polling time (1774521629051 ms) which is after the accepted
detect delta time (1000 ms) configured by
`spark.kubernetes.executor.missingPodDetectDelta`. The executor may have been
deleted but the driver missed the deletion event. Marking this executor as
failed.
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]