Hello, I have a 3-node hadoop cluster - one master & 2 slaves. I want to integrate spark with this hadoop setup so that spark uses yarn for job scheduling & execution.
Hadoop version : 2.7.3 Spark version : 2.1.0 I have read various documentation & blog posts & my understanding so far is that - 1. I need to install spark [download the tar.gz file & extract] on all 3 nodes 2. On master node, update the spark-env.sh as follows - SPARK_DAEMON_JAVA_OPTS=-Dspark.driver.port=53411 HADOOP_CONF_DIR=/home/hadoop/hadoop-2.7.3/etc/hadoop SPARK_MASTER_HOST=192.168.10.44 3. On master node, update slaves file to list IP of the 2 slave nodes 4. On master node, update spark-defaults.conf as follows - spark.master spark://192.168.10.44:7077 spark.serializer org.apache.spark.serializer.KyroSerializer 5. Repeat steps 2 - 5 on the slave nodes as well 6. HDFS & Yarn services are already running 7. Directly use spark-shell to submit jobs to yarn, command to use is - $ ./spark-submit --master yarn --deploy-mode cluster --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=file:///tmp/spark-events --class org.sparkexample.WordCountTask /home/hadoop/first-example-1.0-SNAPSHOT.jar a /user/hadoop Please let me know whether this is correct? Am I missing something? Thanks Bhushan Pathak
