brunolnetto opened a new issue, #13460: URL: https://github.com/apache/iceberg/issues/13460
### Query engine Spark ``` > pip show pyspark Name: pyspark Version: 4.0.0 Summary: Apache Spark Python API Home-page: https://github.com/apache/spark/tree/master/python Author: Spark Developers Author-email: d...@spark.apache.org License: http://www.apache.org/licenses/LICENSE-2.0 Location: /home/pingu/.local/lib/python3.12/site-packages Requires: py4j Required-by: ``` ### Question I am learning Spark on Zach Wilson [free bootcamp](https://github.com/brunolnetto/data-engineer-handbook/tree/main/bootcamp/materials/3-spark-fundamentals), where he teaches some basic spark setup, operations and features. At 3rd week homework, we must bucketize a DataFrame and perform operations in it. However, I am stuck on the step to create the bucketized tables for performatic join operations. The code is below as well. What should I do to make this run? ``` Py4JJavaError: An error occurred while calling o595.saveAsTable. : org.apache.spark.SparkException: Plugin class for catalog 'spark_catalog' does not implement CatalogPlugin: org.apache.spark.sql.hive.HiveSessionCatalog. ``` ``` from pyspark.sql import SparkSession from pyspark import SparkConf from pyspark.sql.functions import expr, col conf = SparkConf() conf.set("spark.driver.host", "spark-iceberg") # Ou IP real conf.set("spark.driver.bindAddress", "0.0.0.0") conf.set("spark.driver.port", "4040") conf.set("spark.blockManager.port", "4041") spark = SparkSession.builder \ .appName("Jupyter") \ .master("local[*]") \ .config("spark.sql.warehouse.dir", "/home/iceberg/warehouse") \ .getOrCreate() # Load data match_details = spark.read.option("header", "true") \ .csv("/home/iceberg/data/match_details.csv") matches = spark.read.option("header","true") \ .csv("/home/iceberg/data/matches.csv") medals = spark.read.option("header","true") \ .csv("/home/iceberg/data/medals.csv") medals_matches_players = spark.read.option("header","true") \ .csv("/home/iceberg/data/medals_matches_players.csv") # 1. Write as bucketed tables match_details.write \ .format('parquet') \ .bucketBy(16, "match_id") \ .sortBy("match_id") \ .saveAsTable("bucketed_match_details") matches.write \ .format('parquet') \ .bucketBy(16, "match_id") \ .sortBy("match_id") \ .saveAsTable("bucketed_matches") medals_matches_players.write \ .format('parquet') \ .bucketBy(16, "match_id") \ .sortBy("match_id") \ .saveAsTable("bucketed_medals_matches_players") # 2. Read bucketed tables bucketed_match_details = spark.table("bucketed_match_details") bucketed_matches = spark.table("bucketed_matches") bucketed_medals_matches_players = spark.table("bucketed_medals_matches_players") # 3. Join joined_df = bucketed_match_details \ .join(bucketed_matches, on="match_id") \ .join(bucketed_medals_matches_players, on="match_id") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org