Difference between dwarf_getscopes and dwarf_getscopes_die

2020-06-13 Thread Milian Wolff
Hey all,

can someone explain me the difference between dwarf_getscopes and 
dwarf_getscopes_die? Ideally, this should then be added to the documentation 
too.

Currently, the documentation states:

```
/* Return scope DIEs containing PC address.
   Sets *SCOPES to a malloc'd array of Dwarf_Die structures,
   and returns the number of elements in the array.
   (*SCOPES)[0] is the DIE for the innermost scope containing PC,
   (*SCOPES)[1] is the DIE for the scope containing that scope, and so on.
   Returns -1 for errors or 0 if no scopes match PC.  */
extern int dwarf_getscopes (Dwarf_Die *cudie, Dwarf_Addr pc,
Dwarf_Die **scopes);

/* Return scope DIEs containing the given DIE.
   Sets *SCOPES to a malloc'd array of Dwarf_Die structures,
   and returns the number of elements in the array.
   (*SCOPES)[0] is a copy of DIE.
   (*SCOPES)[1] is the DIE for the scope containing that scope, and so on.
   Returns -1 for errors or 0 if DIE is not found in any scope entry.  */
extern int dwarf_getscopes_die (Dwarf_Die *die, Dwarf_Die **scopes);
```

Most notably, both say:

```
   (*SCOPES)[1] is the DIE for the scope containing that scope, and so on.
```

But in practice, there seems to be a difference. Inspired by the code form eu-
addr2line we resolve inline frames like this:

```
Dwarf_Addr bias = 0;
Dwarf_Die *cudie = dwfl_module_addrdie(module, ip, &bias);

Dwarf_Die *subroutine = nullptr;
Dwarf_Die *scopes = nullptr;
int nscopes = dwarf_getscopes(cudie, ip - bias, &scopes);
for (int i = 0; i < nscopes; ++i) {
Dwarf_Die *scope = &scopes[i];
const int tag = dwarf_tag(scope);
if (tag == DW_TAG_subprogram || tag == DW_TAG_inlined_subroutine) {
subroutine = scope;
break;
}
}

Dwarf_Die *scopes_die = nullptr;
int nscopes_die = dwarf_getscopes_die(subroutine, &scopes_die);
for (int i = 0; i < nscopes_die; ++i) {
Dwarf_Die *scope = &scopes_die[i];
const int tag = dwarf_tag(scope);
if (tag == DW_TAG_subprogram || tag == DW_TAG_inlined_subroutine) {
// do stuff
}
}

free(scopes_die);
free(scopes);
```

In the above, subroutine points to somewhere in dwarf_getscopes. As such, 
passing it to dwarf_getscopes_die, one would assume from the documentation 
(see above) that we get the sames scopes we already found via 
`dwarf_getscopes` (with an offset), but that is clearly no the case. Here is 
an example:

```
dwarf_getscopes: n = 2
i = 1, tag = 0x11, name = "../../../manual/clients/vector.cpp"
i = 0, tag = 0x1d, name = "_ZN9__gnu_cxx13new_allocatorIdE10deallocateEPdm"
```

Then, we use the DIE at i = 0 from above to call dwarf_getscopes_die, which 
yields:

```
dwarf_getscopes_die, n = 6
i = 5, tag = 0x11, name = "../../../manual/clients/vector.cpp"
i = 4, tag = 0x2e, name = 
"_ZNSt6vectorIdSaIdEE17_M_realloc_insertIJdEEEvN9__gnu_cxx17__normal_iteratorIPdS1_EEDpOT_"
i = 3, tag = 0x1d, name = "_ZNSt12_Vector_baseIdSaIdEE13_M_deallocateEPdm"
i = 2, tag = 0x0b
i = 1, tag = 0x1d, name =  
"_ZNSt16allocator_traitsISaIdEE10deallocateERS0_Pdm"
i = 0, tag = 0x1d, name = "_ZN9__gnu_cxx13new_allocatorIdE10deallocateEPdm"
```

As we can see, dwarf_getscopes_die returns more scopes than dwarf_getscopes. 
For me, the documentation for `dwarf_getscopes` is confusing. I would have 
expected that it should be a superset of dwarf_getscopes_die, but apparently 
that is not the case.

Thanks

-- 
Milian Wolff
m...@milianw.de
http://milianw.de

signature.asc
Description: This is a digitally signed message part.


Can dwarf_getscopes{,_die} performance be improved?

2020-06-13 Thread Milian Wolff
Hey all!

In perfparser we are running into a performance issue with elfutils when we 
try to resolve inline frames. We are following the procedure outlined by eu-
addr2line, i.e. for every IP we basically do:

```
Dwarf_Addr bias = 0;
Dwarf_Die *cudie = dwfl_module_addrdie(module, ip, &bias);

Dwarf_Die *subroutine = nullptr;
Dwarf_Die *scopes = nullptr;
int nscopes = dwarf_getscopes(cudie, ip - bias, &scopes);
for (int i = 0; i < nscopes; ++i) {
Dwarf_Die *scope = &scopes[i];
const int tag = dwarf_tag(scope);
if (tag == DW_TAG_subprogram || tag == DW_TAG_inlined_subroutine) {
subroutine = scope;
break;
}
}

Dwarf_Die *scopes_die = nullptr;
int nscopes_die = dwarf_getscopes_die(subroutine, &scopes_die);
for (int i = 0; i < nscopes_die; ++i) {
Dwarf_Die *scope = &scopes_die[i];
const int tag = dwarf_tag(scope);
if (tag == DW_TAG_subprogram || tag == DW_TAG_inlined_subroutine) {
// do stuff
}
}

free(scopes_die);
free(scopes);
```

Profiling shows that both, the calls to dwarf_getscopes and 
dwarf_getscopes_die can take a really long time. I have seen cases (in 
libclang.so.11) where a single call can take up to ~50ms on a fast desktop 
machine (Ryzen 3900X CPU, fast SSD, 32GB of RAM).

Now while 50ms may not sound sound too problematic, we have to repeat these 
calls for every IP we encounter in a perf.data file. We already apply heavy 
caching, but even then we can easily encounter tens of thousands of individual 
addresses, which can add up to minutes or even hours of time required to 
process the data - only to get information on inlined frames.

I have learned that the DWARF format is mostly meant for efficient storage and 
that one cannot assume that it is efficient for such mass post processing. 
This is the reason why I'm writing this email:

Has anyone an idea on how to to post-process the DWARF data to optimize the 
lookup of inlined frames?

Or is there some room for optimizations / caching within elfutils to amortize 
the repeated DWARF hierarchy walking that happens when one calls 
dwarf_getscopes{,_die}? From what I'm understanding, both calls will start a 
top-down search within the CU DIE via __libdw_visit_scopes. Once a match is 
found, the DIE scope chain is reported. I guess many times (parts of the) DIE 
scope chain will be shared across different IPs, but I don't see any way to 
leverage this to speed up the processing task.

Thanks, any input would be welcome
-- 
Milian Wolff
m...@milianw.de
http://milianw.de

signature.asc
Description: This is a digitally signed message part.