support-cloud opened a new issue, #9824:
URL: https://github.com/apache/iceberg/issues/9824

   ### Query engine
   
   Spark job submitted to yarn from remote jupyter notebook pod
   
   Spark version 3.4.1
   Iceberg version 1.4.3
   hive version 3.1.0
   
   
   ### Question
   
   Hi we are trying to read iceberg hive tables using Apache Spark from jupyter 
notebook pod built on kubernetes.
   
   Spark is configured on yarn externally and we are trying to read iceberg 
hive tables but the job shows Failed when viewed from Yarn application logs.
   
   The spark code we tried is as follows
   
   import os
   import pyspark 
   from pyspark.sql import SparkSession
   from pyspark.sql.functions import udf
   from pyspark.sql.types import FloatType,IntegerType,StructType,StructField
   from pyspark.sql import functions as f
   from pyspark.sql import Window
   
   
   # Session configuration
   spark = (SparkSession.builder.master("yarn").appName("iceberg_test") 
                       .config("spark.jars.packages", 
"org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.4.3") 
                       .config("spark.jars", 
"/usr/hdp/3.1.4.0-315/spark3/jars/iceberg-spark-runtime-3.4_2.12-1.4.3.jar, 
/usr/hdp/3.1.4.0-315/hive/lib/iceberg-hive-runtime-1.4.3.jar, 
/usr/hdp/3.1.4.0-315/spark3/jars/hive-serde-2.3.9.jar")                 
                       .config("spark.sql.catalog.spark_catalog.type", "hive")  
      
                       .config("spark.sql.catalog.spark_catalog", 
"org.apache.iceberg.spark.SparkSessionCatalog")
                       .config("spark.sql.catalog.local", 
"org.apache.iceberg.spark.SparkCatalog")
                       .config("spark.sql.catalog.local.type", "hadoop")
                       .config("spark.sql.catalog.local.warehouse", 
"$PWD/warehouse")
                       .config("iceberg.hive.engine.enabled", "true")
                       .enableHiveSupport() 
                       .getOrCreate()
           )
   
   test_df = spark.sql("select * from icebergdb.default")
   
   test_df.show()
   
   Spark job is submitted with master as "yarn", so i have kept the 
iceberg-runtime relevant jar in the Yarn cluster and in the code i have called 
the jar situated in the remote yarn cluster and not the jupyter pod.
   
   Is this method correct or am i missing any points to be taken into 
consideration?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to