Re: [I] Spark: Doing a Coalesce and foreachpartitions in spark directly on an iceberg table is leaking memory heavy iterators [iceberg]


jkolash commented on issue #13297:
URL: https://github.com/apache/iceberg/issues/13297#issuecomment-2977939838


   I believe it is happening because normally these would be separate tasks but 
coalesce kind of hides each task and combines multiple partitions into 1 
partition so the task cannot "complete" and the callbacks are held much longer.
   
   Also I ran with the parquet v2 code 
   https://github.com/apache/iceberg/issues/13297#issuecomment-2968557949
   
   and a similar fix needs to be applied here I believe.
   
https://github.com/apache/spark/blob/59e6b5b7d350a1603502bc92e3c117311ab2cbb6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala#L312
   
   <img width="1303" alt="Image" 
src="https://github.com/user-attachments/assets/c8375235-63a5-473b-97d9-50ae4654eed0";
 />
   
   > is it for all wide iceberg tables, and coalesce just makes it more 
vulnerable?
   
   This particular table is ~ 500 columns wide and with nesting.  I can produce 
a synthetic dataset later or as part of this issue so it can be reproduced by 
anyone.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Spark: Doing a Coalesce and foreachpartitions in spark directly on an iceberg table is leaking memory heavy iterators [iceberg]

Reply via email to