amogh-jahagirdar opened a new pull request, #11525:
URL: https://github.com/apache/iceberg/pull/11525

   This is a follow up to https://github.com/apache/iceberg/pull/11273/files# 
   
   Instead of broadcasting a map with absolute paths for data files and delete 
files to executors, we could shrink the memory footprint by relativizing the 
in-memory mapping, and then just prior to lookup on executors, reconstruct the 
absolute path as for the relevant delete files.
   
   
   There are a few ways to go about relativization, in the current 
implementation I just did the simplest thing which was to relativize to the 
table location. There are more sophisticated things that could be done to save 
even more memory consumer from paths such as relativize according to the data 
file location (requires surfacing more details from LocationProvider), find the 
longest common prefix between all data/delete files in the rewritable deletes 
(requires a double pass over tasks, once to identify the longest common prefix 
via smallest/largest lexicographical strings, and then another to actually 
reconstruct the delete files). Patricia tries are another possibility though 
the serialized representation seems to take about the same amount of memory, 
not sure why that's the case.
   
   I'm also working on identifying if using spark bytestobytes offheap map will 
save us even more memory but in the mean time thought it made sense to at least 
get this improvement in the interim. This is all internal, so we can always 
remove it down the line if something better comes along.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to