mjf-89 commented on issue #5977: URL: https://github.com/apache/iceberg/issues/5977#issuecomment-1413328965
Hi, I don't have a deep understandin of pyspark internals but I think that you can write to a partitioned iceberg table with the following approach: ``` # registering iceberg udf for partition transformation bucket(32,string_column) spark.sparkContext._jvm.org.apache.iceberg.spark.IcebergSpark.registerBucketUDF(spark._jsparkSession,'iceberg_bucket_str_32',spark.sparkContext._jvm.org.apache.spark.sql.types.DataTypes.StringType,32) # sorting using the registerd udf and write to the partitioned iceberg table df.sortWithinPartitions(F.expr("iceberg_bucket_str_32(string_column)")) \ .writeTo("iceberg_table") \ .using("iceberg") \ .partitionedBy(F.bucket(32,"string_column")) ``` Another option as far as I understand is to avoid sorting and use the fanout writer instead: ``` df.writeTo("iceberg_table") \ .using('iceberg') \ .option("fanout-enabled", "true") \ .partitionedBy(F.bucket(32,"esum_id")) \ .createOrReplace() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org