brunolnetto opened a new issue, #13460:
URL: https://github.com/apache/iceberg/issues/13460

   ### Query engine
   
   Spark
   
   ```
   > pip show pyspark
   Name: pyspark
   Version: 4.0.0
   Summary: Apache Spark Python API
   Home-page: https://github.com/apache/spark/tree/master/python
   Author: Spark Developers
   Author-email: d...@spark.apache.org
   License: http://www.apache.org/licenses/LICENSE-2.0
   Location: /home/pingu/.local/lib/python3.12/site-packages
   Requires: py4j
   Required-by:
   ```
   
   ### Question
   
   I am learning Spark on Zach Wilson [free 
bootcamp](https://github.com/brunolnetto/data-engineer-handbook/tree/main/bootcamp/materials/3-spark-fundamentals),
 where he teaches some basic spark setup, operations and features. At 3rd week 
homework, we must bucketize a DataFrame and perform operations in it. However, 
I am stuck on the step to create the bucketized tables for performatic join 
operations. The code is below as well. What should I do to make this run?
   
   ```
   Py4JJavaError: An error occurred while calling o595.saveAsTable.
   : org.apache.spark.SparkException: Plugin class for catalog 'spark_catalog' 
does not implement CatalogPlugin: org.apache.spark.sql.hive.HiveSessionCatalog.
   ```
   
   ```
   from pyspark.sql import SparkSession
   from pyspark import SparkConf
   from pyspark.sql.functions import expr, col
   
   conf = SparkConf()
   conf.set("spark.driver.host", "spark-iceberg")  # Ou IP real
   conf.set("spark.driver.bindAddress", "0.0.0.0")
   conf.set("spark.driver.port", "4040")
   conf.set("spark.blockManager.port", "4041")
    
   spark = SparkSession.builder \
       .appName("Jupyter") \
       .master("local[*]") \
       .config("spark.sql.warehouse.dir", "/home/iceberg/warehouse") \
       .getOrCreate()
   
   # Load data
   match_details = spark.read.option("header", "true") \
       .csv("/home/iceberg/data/match_details.csv")
   matches = spark.read.option("header","true") \
       .csv("/home/iceberg/data/matches.csv")
   medals = spark.read.option("header","true") \
       .csv("/home/iceberg/data/medals.csv")
   medals_matches_players = spark.read.option("header","true") \
       .csv("/home/iceberg/data/medals_matches_players.csv")
   
   # 1. Write as bucketed tables
   match_details.write \
       .format('parquet') \
       .bucketBy(16, "match_id") \
       .sortBy("match_id") \
       .saveAsTable("bucketed_match_details")
   
   matches.write \
       .format('parquet') \
       .bucketBy(16, "match_id") \
       .sortBy("match_id") \
       .saveAsTable("bucketed_matches")
   
   medals_matches_players.write \
       .format('parquet') \
       .bucketBy(16, "match_id") \
       .sortBy("match_id") \
       .saveAsTable("bucketed_medals_matches_players")
   
   # 2. Read bucketed tables
   bucketed_match_details = spark.table("bucketed_match_details")
   bucketed_matches = spark.table("bucketed_matches")
   bucketed_medals_matches_players = 
spark.table("bucketed_medals_matches_players")
   
   # 3. Join
   joined_df = bucketed_match_details \
       .join(bucketed_matches, on="match_id") \
       .join(bucketed_medals_matches_players, on="match_id")
   ``` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to