On Wed, Apr 18, 2012 at 09:49:11AM +0200, Mike Dupont wrote: > this is exciting, thanks for sharing. > > I wonder what amount of data is even the same between many libraries,
Of course there is a lot of DWARF duplication in between different libraries, or binaries, or e.g. Linux kernel modules (which have the added problem that they have relocations against the sections; we could apply and remove the relocations against .debug_* sections (and do string merging of .debug_str at the same time) there as first step, but there would be still relocations against the module .text/.data etc.). The problem with that is that we'd need DWARF extensions to do the duplication elimination in between different libraries/binaries. I can think of two possible approaches: 1) indicate somehow that .debug_* sections live elsewhere, in a single (per package?) *.debug object, where all the .debug_* sections would be concatenated together and then just compress the debug info in that large object. The main problem with that is that suddenly all places in the debug info that refer to .text/.data (and other allocated sections) addresses need to be augmented somehow to say which of the possibly many shared libraries or kernel modules or binaries they refer to. That would be too hard. It could be done just by some attribute in each DW_TAG_*_unit saying what that CU refers to (if it uses any addresses anywhere), and other .debug_* sections that are solely referenced from .debug_info would be fine too. But e.g. .debug_aranges would need extensions... 2) or, alternatively, keep most of the debug info in the individual objects (shared libraries, binaries, kernel modules) and just for what dwz currently moves over into new DW_TAG_partial_unit CUs (assuming it doesn't contain any .text/.data references and only refers to DIEs inside of them or in other partial units that don't contain any .text/.data references) move those partial units to a .debug_info section in a separate file (and add some new .debug_* section that would hint the debug info consumers how to find the separate file (build-id, or filename, or combination of both, whatever). If we support just one such separate file, we could just have DW_FORM_alt_sec_offset and DW_FORM_ref_alt_addr new forms, which would mean this is the corresponding .debug{_line,_loc,_loc} section offset, but not inside of this file, but in the secondary file. If we were to support more than one, we'd need to number them and add forms that would say start with uleb128 number index of the separate file followed by actual offset. Still, a shorthand form for the first one separate file might be handy, assuming that is what is done most of the time. With many possibly large binaries/libraries together there are major concerns about memory consumption though, so I think the tool would need to do it in steps - compress each file individually first (what the tool does right now) and for eligible partial units append them to a common separate file (and keep them in the original file too). When the first pass over all files is done, merge duplicates within the common separate file which holds just the partial units. Second pass would then take the reduced common separate file and the compressed debug info from the first pass, and find duplicate partial units, switch references to them in their forms to the alt forms and remove the no longer needed partial units. Of course the separate common file would not need to contain just .debug_info and .debug_abbrev sections, but also some minimal .debug_line section (not containing actual line instructions, but dir/file tables). My preference would be 2). What do you think? Jakub