(spark) branch master updated: [SPARK-48651][DOC] Configuring different JDK for Spark on YARN

wenchen Tue, 18 Jun 2024 19:41:27 -0700

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new b77caf776368 [SPARK-48651][DOC] Configuring different JDK for Spark on 
YARN
b77caf776368 is described below

commit b77caf776368154096442965b4a885c4a702d27f
Author: Cheng Pan <[email protected]>
AuthorDate: Wed Jun 19 10:41:06 2024 +0800

    [SPARK-48651][DOC] Configuring different JDK for Spark on YARN
    
    ### What changes were proposed in this pull request?
    
    This PR updates the Spark on YARN docs to guide users to configure a 
different JDK for Spark Applications.
    
    ### Why are the changes needed?
    
    As of today, the latest Apache Hadoop 3.4.0 does not support Java 17 yet, 
while Spark 4.0.0 requires at least Java 17, so users who want to use Spark on 
YARN must configure a different JDK for Spark applications run on YARN.
    
    This is also asked in the mailing list 
https://lists.apache.org/thread/ply807h0hht1h8o7x7g1s3j51mnot5dr
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, it changes the user docs.
    
    ### How was this patch tested?
    
    I verified the command in a YARN cluster.
    
    The following command submits a Spark application with the distributed JDK 
21
    ```
    JAVA_HOME=/opt/openjdk-21 spark-submit \
      --master=yarn \
      --deploy-mode=cluster \
      --archives ./openjdk-21.tar.gz \
      --conf spark.yarn.appMasterEnv.JAVA_HOME=./openjdk-21.tar.gz/openjdk-21 \
      --conf spark.executorEnv.JAVA_HOME=./openjdk-21.tar.gz/openjdk-21 \
      --class org.apache.spark.examples.SparkPi \
      spark-examples*.jar 1
    ```
    
    <img width="1678" alt="image" 
src="https://github.com/apache/spark/assets/26535726/363423a9-bbdf-460d-b6e4-72ab5d6a2e53";>
    
    <img width="1313" alt="image" 
src="https://github.com/apache/spark/assets/26535726/dd8dc1b1-bbe4-41cd-9e19-c8ed68b09f82";>
    
    <img width="1399" alt="image" 
src="https://github.com/apache/spark/assets/26535726/5bbebde6-dfbd-437f-8a44-7c23170911ac";>
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #47010 from pan3793/SPARK-48651.
    
    Lead-authored-by: Cheng Pan <[email protected]>
    Co-authored-by: Kent Yao <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
---
 docs/running-on-yarn.md | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index aab8ee60a256..700ddefabea4 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -33,6 +33,9 @@ Please see [Spark Security](security.html) and the specific 
security sections in
 
 # Launching Spark on YARN
 
+Apache Hadoop does not support Java 17 as of 3.4.0, while Apache Spark 
requires at least Java 17 since 4.0.0, so a different JDK should be configured 
for Spark applications.
+Please refer to [Configuring different JDKs for Spark 
Applications](#configuring-different-jdks-for-spark-applications) for details.
+
 Ensure that `HADOOP_CONF_DIR` or `YARN_CONF_DIR` points to the directory which 
contains the (client side) configuration files for the Hadoop cluster.
 These configs are used to write to HDFS and connect to the YARN 
ResourceManager. The
 configuration contained in this directory will be distributed to the YARN 
cluster so that all
@@ -1032,3 +1035,34 @@ and one should be configured with:
   spark.shuffle.service.name = spark_shuffle_y
   spark.shuffle.service.port = <other value>
 ```
+
+# Configuring different JDKs for Spark Applications
+
+In some cases it may be desirable to use a different JDK from YARN node 
manager to run Spark applications,
+this can be achieved by setting the `JAVA_HOME` environment variable for YARN 
containers and the `spark-submit`
+process.
+
+Note that, Spark assumes that all JVM processes runs in one application use 
the same version of JDK, otherwise,
+you may encounter JDK serialization issues.
+
+To configure a Spark application to use a JDK which has been pre-installed on 
all nodes at `/opt/openjdk-17`:
+
+    $ export JAVA_HOME=/opt/openjdk-17
+    $ ./bin/spark-submit --class path.to.your.Class \
+        --master yarn \
+        --conf spark.yarn.appMasterEnv.JAVA_HOME=/opt/openjdk-17 \
+        --conf spark.executorEnv.JAVA_HOME=/opt/openjdk-17 \
+        <app jar> [app options]
+
+Optionally, the user may want to avoid installing a different JDK on the YARN 
cluster nodes, in such a case,
+it's also possible to distribute the JDK using YARN's Distributed Cache. For 
example, to use Java 21 to run
+a Spark application, prepare a JDK 21 tarball `openjdk-21.tar.gz` and untar it 
to `/opt` on the local node,
+then submit a Spark application:
+
+    $ export JAVA_HOME=/opt/openjdk-21
+    $ ./bin/spark-submit --class path.to.your.Class \
+        --master yarn \
+        --archives path/to/openjdk-21.tar.gz \
+        --conf 
spark.yarn.appMasterEnv.JAVA_HOME=./openjdk-21.tar.gz/openjdk-21 \
+        --conf spark.executorEnv.JAVA_HOME=./openjdk-21.tar.gz/openjdk-21 \
+        <app jar> [app options]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-48651][DOC] Configuring different JDK for Spark on YARN

Reply via email to