amogh-jahagirdar commented on PR #11273: URL: https://github.com/apache/iceberg/pull/11273#issuecomment-2457760335
>I am assuming since we went ahead with broadcasting approach, it sends it chunk by chunk using torrent broadcast as @aokolnychyi mentioned, so OOM not a problem ? I collected some data points on memory consumption of the broadcast here https://docs.google.com/document/d/1yUObq45kBIwyofJYhurcQrpsdXQJC6NFIPBJugiZNWI/edit?tab=t.0 Torrent broadcast is performed in a chunked way but I wouldn't say that doesn't mean OOMs aren't possible. The TLDR is that we would have to be at pretty large scale (multiple millions of data files) and have a very large multiple of deletes per data file for OOMs to be hit in most environments, and running position delete maintenance to shrink that ratio + increasing memory should be a practical enough solution. As maintenance of position deletes runs, that ratio becomes more 1:1 between data to delete files. In V3, this will actually be a requirement. I did a look at more distributed approaches to compute this (changing the Spark APIs to pass historical deletes just for a particular executor) but there are limitations on that. One thing I'm looking into further is how the Spark Delta DV integration looks like to handle this, and we can perhaps take some inspiration from that, but don't think there's really any need to wait for all of that. There are relatively simple things we can do to limit the size of the global map, one is removing any unnecessary metadata that executors don't need, for example referenced manifest locations per delete file (that's only needed in the driver for more efficient commits), and also relativizing the paths in memory. That should shrink the total amount of memory that gets used by the paths in all the in-memory structure and has more of an impact the longer file path before the actual data file/delete file name. My plan is to work towards having the "simple things we can do" in the 1.8 release, so that we further reduce the chance of OOMs in large scale cases. The long term plan is to look at how Spark + delta DV handles this and if it makes sense for us, incorporate that strategy here. >Was mostly coming from when doing BHJ spark enforces 8GB limit and if anything more than that is observed spark fails the query. That 8gb limit is [specific to broadcast joins](https://github.com/apache/spark/blob/e17d8ecabcad6e84428752b977120ff355a4007a/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L226), there's no such limit enforced by Spark itself for arbitrary broadcasts (of course there are system limitations that would get hit). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org