amogh-jahagirdar commented on PR #11273:
URL: https://github.com/apache/iceberg/pull/11273#issuecomment-2457760335

   >I am assuming since we went ahead with broadcasting approach, it sends it 
chunk by chunk using torrent broadcast as @aokolnychyi mentioned, so OOM not a 
problem ? 
   
   I collected some data points on memory consumption of the broadcast here 
https://docs.google.com/document/d/1yUObq45kBIwyofJYhurcQrpsdXQJC6NFIPBJugiZNWI/edit?tab=t.0
   
   Torrent broadcast is performed in a chunked way but I wouldn't say that 
doesn't mean OOMs aren't possible. The TLDR is that we would have to be at 
pretty large scale (multiple millions of data files) and have a very large 
multiple of deletes per data file for OOMs to be hit in most environments, and 
running position delete maintenance to shrink that ratio + increasing memory 
should be a practical enough solution. As maintenance of position deletes runs, 
that ratio becomes more 1:1 between data to delete files. In V3, this will 
actually be a requirement. I did a look at more distributed approaches to 
compute this (changing the Spark APIs to pass historical deletes just for a 
particular executor) but there are limitations on that. One thing I'm looking 
into further is how the Spark Delta DV integration looks like to handle this, 
and we can perhaps take some inspiration from that, but don't think there's 
really any need to wait for all of that.
   
   There are relatively simple things we can do to limit the size of the global 
map, one is removing any unnecessary metadata that executors don't need, for 
example referenced manifest locations per delete file (that's only needed in 
the driver for more efficient commits), and also relativizing the paths in 
memory. That should shrink the total amount of memory that gets used by the 
paths in all the in-memory structure and has more of an impact the longer file 
path before the actual data file/delete file name. 
   
   My plan is to work towards having the "simple things we can do" in the 1.8 
release, so that we further reduce the chance of OOMs in large scale cases. The 
long term plan is to look at how Spark + delta DV handles this and if it makes 
sense for us, incorporate that strategy here.
   
   >Was mostly coming from when doing BHJ spark enforces 8GB limit and if 
anything more than that is observed spark fails the query.
   
   That 8gb limit is [specific to broadcast 
joins](https://github.com/apache/spark/blob/e17d8ecabcad6e84428752b977120ff355a4007a/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L226),
 there's no such limit enforced by Spark itself for arbitrary broadcasts (of 
course there are system limitations that would get hit).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to