[I] [Consult] planTask tasks a lot of time, consult for how to accelerate this [iceberg]

via GitHub Fri, 18 Apr 2025 19:02:14 -0700


littleDrew opened a new issue, #12845:
URL: https://github.com/apache/iceberg/issues/12845


   ### Query engine
   
   #### Here I write and read iceberg table with spark, i mainly do fo 
following operation
   - insert data with merge into SQL, here `write.merge.mode='merge-on-read'`, 
this operation will generate Data File and Delete File, the updated row count 
is almost 20,000,000
   - select data with `select * from table` SQL, here cost **more than 420 
seconds** in planInputPartition(mainly do planTask inner here), i add logs in 
planInputPartition method, and printed logs are as follows, corresponding code 
part is: 
https://github.com/apache/iceberg/blob/0.13.x/spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java#L149
   ```
   2025-04-18 14:32:44, 865 |INFO | [Parallel-Tasks-Thread-1] Time cost for 
planInputPartitions is 421444(ms).| 
org.apache.iceberg.spark.source.SparkBatchScan.planInputPartitions(SparkBatchScan.java:167
   ```
   
   #### I want to consult how to accerate planTask/planInputPartition part in 
driver
   - likely, how can i use multi-thread to do parallel-task-planing, here i see 
this arcticle mentioned this: https://zhuanlan.zhihu.com/p/578466765
   ```
   
⑤[多线程](https://zhida.zhihu.com/search?content_id=216681278&content_type=Article&match_order=1&q=%E5%A4%9A%E7%BA%BF%E7%A8%8B&zhida_source=entity)Plan
 Task，并发或者分布式的删除文件
   
   早期版本的Iceberg plan 
task都是单线程的，当表的规模特别大，文件数量特别多的时候，性能就会急剧下降，还有像删除文件时也是，我们将它们都改成了并发或者分布式的实现。
   ```
   - or other ways to accelerate this part
   
   ### Question
   
   How to accelerate PlanInputPartition/planTask part ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] [Consult] planTask tasks a lot of time, consult for how to accelerate this [iceberg]

Reply via email to