[I] Distributed Iceberg Writer from PySpark to Avoid Extra Shuffle [iceberg]

via GitHub Thu, 27 Nov 2025 13:16:01 -0800


ammarchalifah opened a new issue, #14705:
URL: https://github.com/apache/iceberg/issues/14705


   ### Query engine
   
   Spark on EMR (PySpark)
   Spark 3.5.7
   EMR 7.8
   
   ### Question
   
   I'm facing a difficulty with optimizing a pipeline that write into an 
Iceberg table with a full partition overwrite mode 
   ```
   INSERT OVERWRITE catalog.db.table
   PARTITION (part_key)
   SELECT *
   FROM df
   ```
   
   In the Spark plan, input `df` is not partitioned/distributed by `part_key` 
before the `INSERT` clause is executed. Then, Spark inserts an additional 
`Exchange` by `part_key` just before the writing step. With the volume of data, 
this `Exchange` kills the job performance. The `Exchange` has 
`REPARTITION_BY_COL` attribute. The target table has minimum byte size & 
hash-writing strategy configured as `TBLPROPERTIES`.
   
   My goal is basically to achieve two objectives: (a) do proper insert 
overwrite on partition level, fully overwriting the affected partitions, and 
(b) avoid extra shuffle at all costs. I don't mind having small files or 
unsorted data at the output, we have a separate compaction job running.
   
   Are these objectives achievable? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Distributed Iceberg Writer from PySpark to Avoid Extra Shuffle [iceberg]

Reply via email to