ammarchalifah opened a new issue, #14705: URL: https://github.com/apache/iceberg/issues/14705
### Query engine Spark on EMR (PySpark) Spark 3.5.7 EMR 7.8 ### Question I'm facing a difficulty with optimizing a pipeline that write into an Iceberg table with a full partition overwrite mode ``` INSERT OVERWRITE catalog.db.table PARTITION (part_key) SELECT * FROM df ``` In the Spark plan, input `df` is not partitioned/distributed by `part_key` before the `INSERT` clause is executed. Then, Spark inserts an additional `Exchange` by `part_key` just before the writing step. With the volume of data, this `Exchange` kills the job performance. The `Exchange` has `REPARTITION_BY_COL` attribute. The target table has minimum byte size & hash-writing strategy configured as `TBLPROPERTIES`. My goal is basically to achieve two objectives: (a) do proper insert overwrite on partition level, fully overwriting the affected partitions, and (b) avoid extra shuffle at all costs. I don't mind having small files or unsorted data at the output, we have a separate compaction job running. Are these objectives achievable? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
