johnnywalker opened a new issue, #10051:
URL: https://github.com/apache/iceberg/issues/10051

   ### Feature Request / Improvement
   
   While working with Iceberg on a local Spark cluster, I repeatedly 
encountered heap size errors when using CTAS/RTAS with `PARTITIONED BY`. These 
errors baffled me for a bit and I eventually understood that Adaptive Query 
Execution (AQE) was drastically reducing the partition count, exhausting the 
memory pool. I poured through documentation, but I finally found the root cause 
after digging into source code: Iceberg calculates and supplies an advisory 
partition size to Spark, and Spark prefers this value over the configured 
default.
   
   [Current 
documentation](https://iceberg.apache.org/docs/1.5.0/spark-writes/#controlling-file-sizes)
 explains how Spark AQE will coalesce and split partitions according to the 
advisory partition size, configured by 
`spark.sql.adaptive.advisoryPartitionSizeInBytes`. However, documentation does 
not mention [Iceberg's advisory partition size 
configuration](https://github.com/apache/iceberg/blob/81b62c78e0c230516090becda7d6040ee03e6a91/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java#L688),
 and it does not mention that Iceberg [overrides the Spark 
configuration](https://github.com/apache/spark/blob/8bcbf7701388a2da06369ae9317d7707624edba0/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CoalesceShufflePartitions.scala#L129).
 By default, Iceberg increased the advisory partition size from 64MB to 384MB, 
which exhausted the 4GB executor heap in my local cluster.
   
   As an example, I've added the following to my `spark-defaults.conf` to set 
the session configuration:
   
   ```
   # reduce iceberg's default advisory partition size (384m) to prevent heap 
exhaustion
   spark.sql.iceberg.advisory-partition-size=67108864
   ```
   
   Alternatively, I've tried setting the table property with success:
   
   ```sql
   CREATE TABLE db.table
   USING iceberg
   PARTITIONED BY (days(trandate))
   TBLPROPERTIES ('write.spark.advisory-partition-size-bytes'='33554432')
   AS
   SELECT *
     FROM landing.table;
   ```
   
   ### Query engine
   
   Spark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to