[GitHub] [iceberg] mjf-89 commented on issue #5977: How to write to a bucket-partitioned table using PySpark?

via GitHub Thu, 02 Feb 2023 00:26:50 -0800


mjf-89 commented on issue #5977:
URL: https://github.com/apache/iceberg/issues/5977#issuecomment-1413328965


   Hi, I don't have a deep understandin of pyspark internals but I think that 
you can write to a partitioned iceberg table with the following approach:
   
   ```
   # registering iceberg udf for partition transformation 
bucket(32,string_column)
   
spark.sparkContext._jvm.org.apache.iceberg.spark.IcebergSpark.registerBucketUDF(spark._jsparkSession,'iceberg_bucket_str_32',spark.sparkContext._jvm.org.apache.spark.sql.types.DataTypes.StringType,32)
   
   # sorting using the registerd udf and write to the partitioned iceberg table
   df.sortWithinPartitions(F.expr("iceberg_bucket_str_32(string_column)")) \
     .writeTo("iceberg_table") \
     .using("iceberg") \
     .partitionedBy(F.bucket(32,"string_column")) 
   ```
   Another option as far as I understand is to avoid sorting and use the fanout 
writer instead:
   ```
   df.writeTo("iceberg_table") \
     .using('iceberg') \
     .option("fanout-enabled", "true") \
     .partitionedBy(F.bucket(32,"esum_id")) \
     .createOrReplace()
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] mjf-89 commented on issue #5977: How to write to a bucket-partitioned table using PySpark?

Reply via email to