Re: How to associate Elf with Dwfl_Module returned by dwfl_report_module

2018-03-22 Thread Ulf Hermann
Hi Milian,

> I am regularly seeing broken backtraces for samples where I have 
> the gut feeling that missing reported ELFs are to blame. But we report 
> everything, except for scenarios where the mmap events seemingly overlap.

Actually, at least for perfparser that's not quite true. When perfparser 
encounters an overlap error, it will throw out the entire set of mappings and 
restart reporting, with the addresses from the current sample (see 
PerfSymbolTable::reportElf() and PerfSymbolTable::clearCache()). If that still 
gives you overlapping ranges, it means perf has not sent all the mmap events 
and therefore we're reporting the wrong ELF for some address in your sample. 
That wrong ELF may be larger than the one we actually want and therefore it can 
overlap some other ELF an address in your sample points to.

I've seen that happen. Make sure to keep your sample rate low enough to prevent 
perf from dropping anything.

I realize we could optimize the reporting a bit, with the dwfl_report_end 
callback Mark mentioned, but if you have addresses into two overlapping ELFs in 
one sample, that's fundamentally impossible to unwind.

Ulf


Re: How to associate Elf with Dwfl_Module returned by dwfl_report_module

2018-03-22 Thread Milian Wolff
On Mittwoch, 21. März 2018 22:21:13 CET Mark Wielaard wrote:
> Hi Milian,
> 
> On Wed, Mar 21, 2018 at 02:01:41PM +0100, Milian Wolff wrote:
> > Here's the code for the perf tools:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/
> > perf/util/unwind-libdw.c?h=perf/core#n52
> > 
> > Here's the code for the perfparser:
> > 
> > http://code.qt.io/cgit/qt-creator/perfparser.git/tree/app/
> > perfsymboltable.cpp#n479
> > 
> > Let's concentrate on perf for now, but perfparser has similar logic:
> > 
> > We parse the mmap events in the perf.data file and store that information.
> > Note that the perf.data file does not contain events for munmap calls.
> > Then
> > while unwinding the callstack of a perf sample, we lookup the most recent
> > mmap event for every given instruction pointer address, and ensure that
> > the corresponding ELF was registered with libdw.
> 
> So, modules are never deregistered?
> In that case, that might explain the issue.

No, they are deregistered - that is not the issue. Perf actually starts with a 
clean dwfl on every sample and registers whatever modules are relevant for the 
given sample. perfparser tries to be a bit smarter and caches more, but also 
has code to deregister if something goes amiss.

> But I see there is a check if there is already something at the address.
> The interface to "remove" a module might not be immediately clear.
> The idea is that if modules need to be remove you'll call
> dwfl_report_begin, possibly dwfl_report_elf for any new module and then
> dwfl_report_end has a callback that gets all old modules and decides
> whether to re-report them, or they'll get removed. You might want to
> experiment with doing that and not re-report any module that overlaps
> with the new module. (See the libdwfl.h documentation for a hopefully
> clearer description.)
> 
> > > Specifically are you using false for the add_p_vaddr argument?
> > 
> > Yes, we are.
> > 
> > > And could you provide some example where the reported address is
> > > wrong/different from the start address of the Dwfl_Module?
> > 
> > I don't think it's the start address that is wrong, rather it's the end
> > address. But it's hard for me to come up with a small selfcontained
> > example at this stage. I am regularly seeing broken backtraces for
> > samples where I have the gut feeling that missing reported ELFs are to
> > blame. But we report everything, except for scenarios where the mmap
> > events seemingly overlap. This overlapping is, as far as I can see,
> > actually a side effect of remapping taking place in the dynamic linker
> > (i.e. a single dlopen/dynamic linked library can yield multiple mmap
> > events). One way or another, we end up with a situation where we cannot
> > report an ELF to dwfl due to two issues:
> > 
> > a) either ELF tells us we are overlapping some module and just stops which
> > is bad, since we would actually much prefer the newly reported ELF to
> > take precedence
> > 
> > b) we find an mmap event that with a non-zero pgoff, and have no clue how
> > to call dwfl_report_elf and just give up.
> > 
> > In both cases, I was hopeing for dwfl_report_module to help since it
> > seemingly allows me to exactly recreate the mapping that was traced
> > originally.
> If you could add some logging and post that plus the eu-readelf -l
> output of the ELF file, that might help track down what is really going
> on.

Yes, I will try to find the time to write a more elaborate reproducer for this 
issue, to better figure out what is going on here.

Bye

-- 
Milian Wolff
m...@milianw.de
http://milianw.de

signature.asc
Description: This is a digitally signed message part.


Re: How to associate Elf with Dwfl_Module returned by dwfl_report_module

2018-03-22 Thread Milian Wolff
On Mittwoch, 21. März 2018 22:31:07 CET Mark Wielaard wrote:
> On Wed, Mar 21, 2018 at 03:23:51PM +0100, Milian Wolff wrote:
> > Here's one way to investigate where perf and dwfl disagree on the file
> > mappings:
> > 
> > diff --git a/tools/perf/util/unwind-libdw.c
> > b/tools/perf/util/unwind-libdw.c index 7bdd239c795c..739c68b73772 100644
> > --- a/tools/perf/util/unwind-libdw.c
> > +++ b/tools/perf/util/unwind-libdw.c
> > @@ -48,11 +48,17 @@ static int __report_module(struct addr_location *al,
> > u64 ip,
> > 
> > mod = 0;
> > 
> > }
> > 
> > -   if (!mod)
> > +   if (!mod) {
> > +   Dwarf_Addr s, e;
> > 
> > mod = dwfl_report_elf(ui->dwfl, dso->short_name,
> > 
> >   (dso->symsrc_filename ? 
> > dso->symsrc_filename : dso-
> > >
> > >long_name), -1, al->map->start,
> > >
> >   false);
> > 
> > +   dwfl_module_info(mod, NULL, &s, &e, NULL, NULL, NULL, NULL);
> > +   if (al->map->start != s || al->map->end != e)
> > +   pr_warning("MEH: %s | mmap: %zx %zx | dwfl: %zx %zx\n", 
> > dso-
> > 
> > >short_name, al->map->start, al->map->end, s, e);
> > 
> > +   }
> > +
> > 
> > return mod && dwfl_addrmodule(ui->dwfl, ip) == mod ? 0 : -1;
> >  
> >  }
> > 
> > On one of my perf.data files with many broken backtraces this gives me:
> > 
> > MEH: heaptrack_print | mmap: 56166e9d4000 56166ea39000 | dwfl:
> > 56166e9d4000
> > 56166ea38880
> > MEH: ld-2.26.so | mmap: 7fd0afc6c000 7fd0afe93000 | dwfl: 7fd0afc6c000
> > 7fd0afe920f8
> > MEH: libc-2.26.so | mmap: 7fd0ae16a000 7fd0ae521000 | dwfl: 7fd0ae16a000
> > 7fd0ae5208f0
> > MEH: libstdc++.so.6.0.24 | mmap: 7fd0aea84000 7fd0aee0b000 | dwfl:
> > 7fd0aea84000 7fd0aee0a640
> > MEH: libzstd.so.1.3.3 | mmap: 7fd0aee0b000 7fd0af087000 | dwfl:
> > 7fd0aee0b000 7fd0af086030
> > 
> > Interestingly, here the mmap events observed by perf are actually always
> > *larger* than what dwfl sees...
> 
> I think that is simply the end address being extended to the next page.
> Where the page size seems to be 4K (0x1000). I assume if you look at
> the actual LOAD segment sizes of the files (eu-readelf -l) they match
> with what dwfl reports, and if it extends to the nearest 4k page it
> matches whatp mmap reports.

Yes, I agree - the above does not explain the issues I am seeing. It is most 
probably just noise and can be ignored.

I'll try to dig deeper to figure out what is going on and then come back here. 
For now, it seems like using dwfl_report_elf is fine.

Cheers

-- 
Milian Wolff
m...@milianw.de
http://milianw.de

signature.asc
Description: This is a digitally signed message part.


Re: How to associate Elf with Dwfl_Module returned by dwfl_report_module

2018-03-22 Thread Milian Wolff
On Donnerstag, 22. März 2018 10:11:29 CET Ulf Hermann wrote:
> Hi Milian,
> 
> > I am regularly seeing broken backtraces for samples where I have
> > the gut feeling that missing reported ELFs are to blame. But we report
> > everything, except for scenarios where the mmap events seemingly overlap.
> 
> Actually, at least for perfparser that's not quite true. When perfparser
> encounters an overlap error, it will throw out the entire set of mappings
> and restart reporting, with the addresses from the current sample (see
> PerfSymbolTable::reportElf() and PerfSymbolTable::clearCache()).

Yes, I know :) I was more talking about the libdwfl integraiton in perf there.

> If that
> still gives you overlapping ranges, it means perf has not sent all the mmap
> events and therefore we're reporting the wrong ELF for some address in your
> sample. That wrong ELF may be larger than the one we actually want and
> therefore it can overlap some other ELF an address in your sample points
> to.
> 
> I've seen that happen. Make sure to keep your sample rate low enough to
> prevent perf from dropping anything.
> 
> I realize we could optimize the reporting a bit, with the dwfl_report_end
> callback Mark mentioned, but if you have addresses into two overlapping
> ELFs in one sample, that's fundamentally impossible to unwind.

In the concrete data file I have at hand, no chunks got lost, so I don't think 
that mmap events could have been lost?

Cheers

-- 
Milian Wolff
m...@milianw.de
http://milianw.de

signature.asc
Description: This is a digitally signed message part.