Re: How to associate Elf with Dwfl_Module returned by dwfl_report_module
Hi Milian, > I am regularly seeing broken backtraces for samples where I have > the gut feeling that missing reported ELFs are to blame. But we report > everything, except for scenarios where the mmap events seemingly overlap. Actually, at least for perfparser that's not quite true. When perfparser encounters an overlap error, it will throw out the entire set of mappings and restart reporting, with the addresses from the current sample (see PerfSymbolTable::reportElf() and PerfSymbolTable::clearCache()). If that still gives you overlapping ranges, it means perf has not sent all the mmap events and therefore we're reporting the wrong ELF for some address in your sample. That wrong ELF may be larger than the one we actually want and therefore it can overlap some other ELF an address in your sample points to. I've seen that happen. Make sure to keep your sample rate low enough to prevent perf from dropping anything. I realize we could optimize the reporting a bit, with the dwfl_report_end callback Mark mentioned, but if you have addresses into two overlapping ELFs in one sample, that's fundamentally impossible to unwind. Ulf
Re: How to associate Elf with Dwfl_Module returned by dwfl_report_module
On Mittwoch, 21. März 2018 22:21:13 CET Mark Wielaard wrote: > Hi Milian, > > On Wed, Mar 21, 2018 at 02:01:41PM +0100, Milian Wolff wrote: > > Here's the code for the perf tools: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/ > > perf/util/unwind-libdw.c?h=perf/core#n52 > > > > Here's the code for the perfparser: > > > > http://code.qt.io/cgit/qt-creator/perfparser.git/tree/app/ > > perfsymboltable.cpp#n479 > > > > Let's concentrate on perf for now, but perfparser has similar logic: > > > > We parse the mmap events in the perf.data file and store that information. > > Note that the perf.data file does not contain events for munmap calls. > > Then > > while unwinding the callstack of a perf sample, we lookup the most recent > > mmap event for every given instruction pointer address, and ensure that > > the corresponding ELF was registered with libdw. > > So, modules are never deregistered? > In that case, that might explain the issue. No, they are deregistered - that is not the issue. Perf actually starts with a clean dwfl on every sample and registers whatever modules are relevant for the given sample. perfparser tries to be a bit smarter and caches more, but also has code to deregister if something goes amiss. > But I see there is a check if there is already something at the address. > The interface to "remove" a module might not be immediately clear. > The idea is that if modules need to be remove you'll call > dwfl_report_begin, possibly dwfl_report_elf for any new module and then > dwfl_report_end has a callback that gets all old modules and decides > whether to re-report them, or they'll get removed. You might want to > experiment with doing that and not re-report any module that overlaps > with the new module. (See the libdwfl.h documentation for a hopefully > clearer description.) > > > > Specifically are you using false for the add_p_vaddr argument? > > > > Yes, we are. > > > > > And could you provide some example where the reported address is > > > wrong/different from the start address of the Dwfl_Module? > > > > I don't think it's the start address that is wrong, rather it's the end > > address. But it's hard for me to come up with a small selfcontained > > example at this stage. I am regularly seeing broken backtraces for > > samples where I have the gut feeling that missing reported ELFs are to > > blame. But we report everything, except for scenarios where the mmap > > events seemingly overlap. This overlapping is, as far as I can see, > > actually a side effect of remapping taking place in the dynamic linker > > (i.e. a single dlopen/dynamic linked library can yield multiple mmap > > events). One way or another, we end up with a situation where we cannot > > report an ELF to dwfl due to two issues: > > > > a) either ELF tells us we are overlapping some module and just stops which > > is bad, since we would actually much prefer the newly reported ELF to > > take precedence > > > > b) we find an mmap event that with a non-zero pgoff, and have no clue how > > to call dwfl_report_elf and just give up. > > > > In both cases, I was hopeing for dwfl_report_module to help since it > > seemingly allows me to exactly recreate the mapping that was traced > > originally. > If you could add some logging and post that plus the eu-readelf -l > output of the ELF file, that might help track down what is really going > on. Yes, I will try to find the time to write a more elaborate reproducer for this issue, to better figure out what is going on here. Bye -- Milian Wolff m...@milianw.de http://milianw.de signature.asc Description: This is a digitally signed message part.
Re: How to associate Elf with Dwfl_Module returned by dwfl_report_module
On Mittwoch, 21. März 2018 22:31:07 CET Mark Wielaard wrote: > On Wed, Mar 21, 2018 at 03:23:51PM +0100, Milian Wolff wrote: > > Here's one way to investigate where perf and dwfl disagree on the file > > mappings: > > > > diff --git a/tools/perf/util/unwind-libdw.c > > b/tools/perf/util/unwind-libdw.c index 7bdd239c795c..739c68b73772 100644 > > --- a/tools/perf/util/unwind-libdw.c > > +++ b/tools/perf/util/unwind-libdw.c > > @@ -48,11 +48,17 @@ static int __report_module(struct addr_location *al, > > u64 ip, > > > > mod = 0; > > > > } > > > > - if (!mod) > > + if (!mod) { > > + Dwarf_Addr s, e; > > > > mod = dwfl_report_elf(ui->dwfl, dso->short_name, > > > > (dso->symsrc_filename ? > > dso->symsrc_filename : dso- > > > > > >long_name), -1, al->map->start, > > > > > false); > > > > + dwfl_module_info(mod, NULL, &s, &e, NULL, NULL, NULL, NULL); > > + if (al->map->start != s || al->map->end != e) > > + pr_warning("MEH: %s | mmap: %zx %zx | dwfl: %zx %zx\n", > > dso- > > > > >short_name, al->map->start, al->map->end, s, e); > > > > + } > > + > > > > return mod && dwfl_addrmodule(ui->dwfl, ip) == mod ? 0 : -1; > > > > } > > > > On one of my perf.data files with many broken backtraces this gives me: > > > > MEH: heaptrack_print | mmap: 56166e9d4000 56166ea39000 | dwfl: > > 56166e9d4000 > > 56166ea38880 > > MEH: ld-2.26.so | mmap: 7fd0afc6c000 7fd0afe93000 | dwfl: 7fd0afc6c000 > > 7fd0afe920f8 > > MEH: libc-2.26.so | mmap: 7fd0ae16a000 7fd0ae521000 | dwfl: 7fd0ae16a000 > > 7fd0ae5208f0 > > MEH: libstdc++.so.6.0.24 | mmap: 7fd0aea84000 7fd0aee0b000 | dwfl: > > 7fd0aea84000 7fd0aee0a640 > > MEH: libzstd.so.1.3.3 | mmap: 7fd0aee0b000 7fd0af087000 | dwfl: > > 7fd0aee0b000 7fd0af086030 > > > > Interestingly, here the mmap events observed by perf are actually always > > *larger* than what dwfl sees... > > I think that is simply the end address being extended to the next page. > Where the page size seems to be 4K (0x1000). I assume if you look at > the actual LOAD segment sizes of the files (eu-readelf -l) they match > with what dwfl reports, and if it extends to the nearest 4k page it > matches whatp mmap reports. Yes, I agree - the above does not explain the issues I am seeing. It is most probably just noise and can be ignored. I'll try to dig deeper to figure out what is going on and then come back here. For now, it seems like using dwfl_report_elf is fine. Cheers -- Milian Wolff m...@milianw.de http://milianw.de signature.asc Description: This is a digitally signed message part.
Re: How to associate Elf with Dwfl_Module returned by dwfl_report_module
On Donnerstag, 22. März 2018 10:11:29 CET Ulf Hermann wrote: > Hi Milian, > > > I am regularly seeing broken backtraces for samples where I have > > the gut feeling that missing reported ELFs are to blame. But we report > > everything, except for scenarios where the mmap events seemingly overlap. > > Actually, at least for perfparser that's not quite true. When perfparser > encounters an overlap error, it will throw out the entire set of mappings > and restart reporting, with the addresses from the current sample (see > PerfSymbolTable::reportElf() and PerfSymbolTable::clearCache()). Yes, I know :) I was more talking about the libdwfl integraiton in perf there. > If that > still gives you overlapping ranges, it means perf has not sent all the mmap > events and therefore we're reporting the wrong ELF for some address in your > sample. That wrong ELF may be larger than the one we actually want and > therefore it can overlap some other ELF an address in your sample points > to. > > I've seen that happen. Make sure to keep your sample rate low enough to > prevent perf from dropping anything. > > I realize we could optimize the reporting a bit, with the dwfl_report_end > callback Mark mentioned, but if you have addresses into two overlapping > ELFs in one sample, that's fundamentally impossible to unwind. In the concrete data file I have at hand, no chunks got lost, so I don't think that mmap events could have been lost? Cheers -- Milian Wolff m...@milianw.de http://milianw.de signature.asc Description: This is a digitally signed message part.