[I] How to avoid performing partition key sorting when inserting into a partitioned Iceberg table? [iceberg]

via GitHub Thu, 18 Apr 2024 23:01:13 -0700


snfddl opened a new issue, #10181:
URL: https://github.com/apache/iceberg/issues/10181


   ### Query engine
   
   spark 3.2
   
   ### Question
   
   1. create partitioned table
   ```
   create table temp.partition_table
   (
       dt string
      ,contents string
   )
   partitioned by spec (dt)
   stored as iceberg;
   ```
   
   2. Insert data with one partition key value into a partitioned table
   
   ```
   insert into temp.partition_table
   select dt
            ,text as contents
     from temp.dataset
    where dt = '20240418'
   ```
    
   3. physical plan
   ```
   AppendData 
org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$3485/49797107@3d2cac62,
 IcebergWrite(table=spark_catalog.temp.partition_table, format=PARQUET)
   +- AdaptiveSparkPlan isFinalPlan=false
      +- Sort [dt#107 ASC NULLS FIRST], false, 0   <<<<<<<<<
         +- Exchange hashpartitioning(dt#107, 200), REPARTITION_BY_NUM, 
[plan_id=175]
            +- Project [dt#107, ansi_cast(contents#106 as string) AS ctnt#110]
               +- FileScan parquet temp.dataset
   
   ```
   
   
   In this case, since dt, the partition key column, has only one value, I 
don't think there is a need to perform sorting using the partition key right 
before writing. However, it appears that sorting is always performed using the 
partition key when inserting into a partitioned iceberg table.
   Is there any way to avoid this?
   
   In impala, unnecessary sorting could be avoided by using the 
/*+noclustered*/ hint when performing the same type of insert into a general 
parquet-based table, so I thought the same function existed, but I couldn't 
find it.
   
   
   I also tried static partition insert, but the plan was the same.
   (insert into temp.partition_table partition(dt='20240406') select contents 
from temp.dataset where dt='20240406')
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] How to avoid performing partition key sorting when inserting into a partitioned Iceberg table? [iceberg]

Reply via email to