Performance issue with systemd-coredump and container process linking 2000 shared libraries.

2023-06-14 Thread Romain Geissler via Elfutils-devel
Hi,

I am reaching out the elfutils mailing to discuss one issue we have seen in one
of our OpenShift production workload. Note that if you are a Red Hat employee,
this actually corresponds to Red Hat support case 03536980 which I have yet to
update with this informations.

Description of the issue:
While moving from OpenShift 4.12 to 4.13, we also move the system from a RHEL 8
base to a RHEL 9 one. Among the changes, it upgrades systemd to version 252.
Historically systemd-coredump use to mind only about buildids when analysing
core dumps, however now it also tries to parse the package metadata note section
(cf https://systemd.io/ELF_PACKAGE_METADATA/).

We use sytemd-coredump as our default core handler, and being an OpenShift node
we run there mainly containers. As part of our container, we have a particular
binary which links against close to 2000 shared libraries. And when this program
crashes, it takes several minutes for systemd-coredump to "analyze" the core,
which is spent in parsing the elf part of it.

I have been able to extract from the systemd code (cf
https://github.com/systemd/systemd/blob/22afd5cb2cf06418e5a84907b1ff782d29a33e47/src/shared/elf-util.c#L482
 )
the logic of systemd into a simplified C program that links against systemd to
show how "slow" is systemd. I will attach this program a bit after, and also
explain a bit my findings and what "fix" I suggested internally we propose to
systemd folks.

First to reproduce, we can create a dummy binary that links against 2000 lib.
Credits for finding and writing this simple reproducer goes to Francois Rigault,
one of my colleague:

cat > lib.c << EOF
int TOTO = 42;
EOF

cat > generate-core.c << EOF
#include 
int main() {
printf("started\n");
int* i = 0;
*i = 42;
}
EOF

N_LIBS=2000
LIBS=""
for i in {1..${N_LIBS}}; do
[[ -f liba$i.so ]] || gcc -shared -fPIC -o liba$i.so lib.c -D TOTO=var$i
done
echo libs built

gcc -o generate-core generate-core.c -L. -Wl,-rpath,$(pwd) -la{1..${N_LIBS}}
echo running binary
time ./generate-core

# Here a core is generated, rename it to "the-core" in the current directory.

# Finally, remove all libaN.so files, to simulate the fact that systemd-coredump
# runs in a different mount filesystem as the container process which crashed
# (otherwise the following reproducer might be "fast"):
rm liba*.so



Once a core dump has been generated, I tried to mimic the code of 
systemd-coredump.
It looks like this:

cat > mimic-systemd-coredump.c << EOF
#include "elfutils/libdwfl.h"
#include 
#include 

static const Dwfl_Callbacks callbacks = {
.find_elf = dwfl_build_id_find_elf,
.section_address = dwfl_offline_section_address,
.find_debuginfo = dwfl_standard_find_debuginfo,
};

void print_elf_error()
{
printf("Elf error: %s\n", elf_errmsg(elf_errno()));
}

void print_dwfl_error()
{
printf("Dwfl error: %s\n", dwfl_errmsg(dwfl_errno()));
}

// In Systemd, this tries to retrieve the Json package metadata note.
// returns a value > 0 when data is found
// returns 0 when no data is found, but no error happened
// returns a value < 0 when a strong error happen and we have to stop.
int parse_module_package_metadata_best_effort(Elf* elf, const char* name)
{
size_t n_program_headers;
size_t r = elf_getphdrnum(elf, &n_program_headers);

if (r == 0)
{
printf("Found %lli program header for module %s\n", n_program_headers, 
name);
// Don't implement the full systemd logic here.

return 1;
}
else
{
// Best effort parsing: silently ignore the error.
return 0;
}
}

static int module_callback(Dwfl_Module *mod __attribute__ ((unused)),
void **userdata __attribute__ ((unused)),
const char *name __attribute__ ((unused)),
Dwarf_Addr start, void *arg)
{
Elf* elf = NULL;

GElf_Addr bias;
// Comment the next line to "improve" performances.
elf = dwfl_module_getelf(mod, &bias);

if (elf)
{
printf("dwfl_module_getelf has found the elf program %s\n", name);
int ret = parse_module_package_metadata_best_effort(elf, name);

if (ret > 0)
return DWARF_CB_OK;
else if (ret < 0)
return DWARF_CB_ABORT;
else
elf = arg;
}
else
{
printf("dwfl_module_getelf returned NULL for module %s\n", name);
//print_dwfl_error();
elf = arg;
}

size_t n_program_headers;
assert(elf_getphdrnum(elf, &n_program_headers) >= 0);
printf("iterate: %s: %lli\n", name, n_program_headers);

for (size_t i = 0; i < n_program_headers; ++i) {
GElf_Phdr mem, *program_header;
Elf_Data *data;

/* The core file stores the ELF files in the PT_LOAD segment. */
program_header = gelf_getphdr(elf, i, &mem);
if (!program_header || program_header->p_type != PT_LOAD)
continue;

/* This PT_LOAD section doesn't contain the start address, so it can'

Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.

2023-06-19 Thread Romain GEISSLER via Elfutils-devel
> Le 19 juin 2023 à 17:08, Mark Wielaard  a écrit :
> 
> Hi Romain,
> 
> Just to let you know I am looking at this. But haven't made much
> progress in understanding it yet. Thanks so much for the reproducer. I
> have been able to see the (very slow) parsing of the core file with it.

Hi,

Thanks ! And sorry that Laurent had pinged you directly on Slack, I
wanted to reach you via this mailing list instead of through the Red
Hat customer network ;)

I don’t know if you read the Red Hat case too. There you can find
things a bit more clarified, and splitted into what I think are potentially
3 distinct "problems" which 3 distinct possible fix. Since there is nothing
private, I can write on this here as well on this public mailing list.

So in the end I see 3 points (in addition to not understanding why
finding the elf header returns NULL while it should not and which I
guess you are currently looking at):
 - the idea that systemd developers should invert their logic: first
try to parse elf/program headers from the (maybe partial) core dump
PT_LOAD program headers
 - This special "if" condition that I have added in the original systemd
code:

+/* This PT_LOAD section doesn't contain the start address, so 
it can't be the module we are looking for. */
+if (start < program_header->p_vaddr || start >= 
program_header->p_vaddr + program_header->p_memsz)
+continue;

to be added near this line: 
https://github.com/systemd/systemd/blob/72e7bfe02d7814fff15602726c7218b389324159/src/shared/elf-util.c#L540

on which I would like to ask you if indeed it seems like a "right" fix with
your knowledge of how core dump and elf files are shaped.
 - The idea that maybe this commit 
https://sourceware.org/git/?p=elfutils.git;a=commitdiff;h=8db849976f07046d27b4217e9ebd08d5623acc4f
which assumed that normally the order of magnitude of program headers
is 10 for a "normal" elf file, so a linked list would be enough might be
wrong in the special case of core dump which may have much more
program headers. And if indeed it makes sense to elf_getdata_rawchunk
for each and every program header of a core, in that case should this
linked list be changed into some set/hashmap indexed by start
address/size ?

> 
> $ time ./mimic-systemd-coredump
> [...]
> real3m35.965s
> user0m0.722s
> sys 0m0.345s
> 
> Note however that a lot of time is "missing".
> And in fact running it again is fast!?!
> 
> $ time ./mimic-systemd-coredump
> real0m0.327s
> user0m0.272s
> sys 0m0.050s
> 
> This is because of the kernel inode/dentry cache.
> If I do $ echo 2 | sudo tee /proc/sys/vm/drop_caches
> before running ./mimic-systemd-coredump it is always slow.

Interesting ! I didn’t see that (actually I never let the program run till the
end !).

> Which does bring up the question why systemd-coredump isn't running in
> the same mount space as the crashing program. Then it would simply find
> the files that the crashing program is using.

On this point that systemd-coredump might not run in the same mount
namespace, don’t blindly believe me. I think I saw this while reviewing the
systemd code, but it was the first time I looked at it to investigate this 
issue,
so may be wrong. But I am sure you have access to some systemd
colleagues at Red Hat to double-check the details ;)

Cheers,
Romain

Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.

2023-06-20 Thread Romain GEISSLER via Elfutils-devel
> Le 20 juin 2023 à 23:37, Mark Wielaard  a écrit :
> 
> Hi,
> 
> On Mon, Jun 19, 2023 at 05:08:50PM +0200, Mark Wielaard wrote:
> 
> So I made a mistake here. Since I was testing on fedora 38 which has
> DEBUGINFOD_URLS set. Without DEBUGINFOD_URLS set there is no big
> slowdown.
> 
> Do you have the DEBUGINFOD_URLS environment variable set?
> 
> The real sd-coredump will not have DEBUGINFOD_URLS set (I hope).
> 
> Thanks,
> 
> Mark

Hi,

Our real use case happens on a Openshift 4.13 node, so the OS is Red Hat Core 
OS 9 (which I assume shares a lot of foundations with RHEL 9).

On our side Francois also told me this afternoon that he didn’t really 
reproduce the same thing with my reproducer posted here and the real 
systemd-coredump issue he witnessed live, and also noticed that with 
DEBUGINFOD_URLS unset/set to the empty string my reproducer has no problem 
anymore. What he witnessed on the real case (using perf/gdb) was that 
apparently lots of time was spend in elf_getdata_rawchunk and often in this 
kind of stack:

Samples: 65K of event 'cpu-clock:pppH', Event count (approx.): 1646850  

   
Overhead  Command Shared Object Symbol  

   
  98.24%  (sd-parse-elf)  libelf-0.188.so   [.] elf_getdata_rawchunk
   0.48%  (sd-parse-elf)  libelf-0.188.so   [.] 0x48a3
   0.27%  (sd-parse-elf)  libelf-0.188.so   [.] gelf_getphdr
   0.11%  (sd-parse-elf)  libc.so.6 [.] _int_malloc
   0.10%  (sd-parse-elf)  libelf-0.188.so   [.] gelf_getnote
   0.06%  (sd-parse-elf)  libc.so.6 [.] __libc_calloc
   0.05%  (sd-parse-elf)  [kernel.kallsyms] [k] 
__softirqentry_text_start
   0.05%  (sd-parse-elf)  libc.so.6 [.] _int_free


(gdb) bt
#0  0x7f0ba8a88194 in elf_getdata_rawchunk () from target:/lib64/libelf.so.1
#1  0x7f0ba98e5013 in module_callback.lto_priv () from 
target:/usr/lib64/systemd/libsystemd-shared-252.so
#2  0x7f0ba8ae7291 in dwfl_getmodules () from target:/lib64/libdw.so.1
#3  0x7f0ba98e6dc0 in parse_elf_object () from 
target:/usr/lib64/systemd/libsystemd-shared-252.so
#4  0x562c474f2d5e in submit_coredump ()
#5  0x562c474f57d1 in process_socket.constprop ()
#6  0x562c474efbf8 in main ()

My reproducer actually doesn’t fully re-implement what systemd implements (the 
parsing of the package metadata is clearly omitted), so I thought I had 
reproduced the same problem while apparently I didn’t, sorry for that. We will 
also have to double check if really just using 2000 dummy libraries is enough 
or if this also needs to have a more complex binary like we have in our real 
case.

Tomorrow on our side we will have to play a bit with a local build of 
systemd-coredump and try to run it manually to better understand what’s going 
wrong.


Note: when I wrote and tested my reproducer, I used a fedora:38 container, 
which doesn’t have DEBUGINFOD_URLS set (which may be different from a real 
fedora 38, non containerized)

[root@7563ccfb7a39 /]# printenv|grep DEBUGINFOD_URLS
[root@7563ccfb7a39 /]# find /etc/profile.d/|grep debug
[root@7563ccfb7a39 /]# cat /etc/os-release
NAME="Fedora Linux"
VERSION="38 (Container Image)"

Cheers,
Romain



Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.

2023-06-21 Thread Romain GEISSLER via Elfutils-devel
> Le 21 juin 2023 à 18:24, Mark Wielaard  a écrit :
> 
> 
> 
> Hi Romain,
> 
> Seeing those performance results I understand why you were suspecting
> the linked list data structure used in elf_getdata_rawchunk.
> 
> Would you be able to test a rebuild libelf with the attached patch,
> which replaces that datastructure with a binary search tree?
> 
> It didn't really show much speedup locally (in the noise, maybe 0.01
> sec faster on ~0.25 sec run). But if there are more than 2000 calls to
> elf_getdata_rawchunk it should make things faster.

Hi Mark,

Actually I have spent some time today to reproduce the issue for real, using
directly systemd. The details can be found in
https://github.com/systemd/systemd/pull/28093 which was just merged. This
change in systemd is enough to fix the biggest part of the "slow" parsing of
cores. Yet even with my systemd patch, we will still call elf_getdata_rawchunk
2000 times if the crashing process had 2000 shared libraries, so indeed your
patch in elfutils might still be welcome. I will test it later when I go home, 
or
tomorrow.

Thanks,
Romain

Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.

2023-06-22 Thread Romain GEISSLER via Elfutils-devel
> Le 21 juin 2023 à 21:39, Mark Wielaard  a écrit :
> 
> 
> Hi Romain,
> 
> That patch looks good. It should reduce the number of calls to
> elf_getdata_rawchunk a lot. Making it less urgent that function is
> fast. But if you could test it that would be appreciated. I'll also
> test it a bit more and will probably merge it if no issues show up
> since it does seem better than keep using the linear list.
> 
> Thanks,
> 
> Mark


Hi,

So I have done some testing, running the command:

[root@fa28b3d254fd systemd]# rm -rf /var/lib/systemd/coredump/* && journalctl 
--flush --rotate && journalctl --vacuum-time=1s && time cat the-core | 
build/systemd-coredump $$ 0 0 11 1687360485 1000 localhost && 
journalctl

Where "the-core" is our real core dump we had in production, with around 1700+ 
shared libraries loaded, and the uncompressed core dump size is 2GB+.

Without the systemd patch, without the elfutils patch.
real3m42.308s
user3m39.579s
sys 0m2.294s


Without the systemd patch, with the elfutils patch (3 runs, first one excluded 
to make sure the kernel caches what it caches):
real0m15.543s
user0m13.662s
sys 0m2.405s

real0m15.976s 
user0m13.832s 
sys 0m2.481s

real0m15.612s
user0m13.687s 
sys 0m2.470s


With the systemd patch, without the elf utils patch (3 runs, first one excluded 
to make sure the kernel caches what it caches):
real0m2.994s  
user0m1.104s
sys 0m2.477s

real0m3.011s  
user0m1.154s  
sys 0m2.447s

real0m3.009s
user0m1.141s  
sys 0m2.457s


With the systemd patch, with the elf utils patch (3 runs, first one excluded to 
make sure the kernel caches what it caches):
real0m2.921s  
user0m1.093s
sys 0m2.399s

real0m2.950s
user0m1.129s  
sys 0m2.401s

real0m2.933s  
user0m1.136s  
sys 0m2.371s


Overall we can see that both fix independently fix the performance issue (yet 
the systemd patch seems a bit more effective), so I guess both fixes are 
welcome.

Mark, do you think it's worth backporting this in CentOS Steam 9/RHEL 9 on 
elfutils side ? If you need a ticket, we have Red Hat case 03536980 which lead 
to RHBZ 2215412.

Thanks !
Romain


Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.

2023-06-30 Thread Romain GEISSLER via Elfutils-devel
> Le 27 juin 2023 à 16:09, Mark Wielaard  a écrit :
>
> Hi,
>
> On Wed, 2023-06-21 at 18:24 +0200, Mark Wielaard wrote:
>
> I pushed this now. It is also in Fedora Rawhide:
> https://bodhi.fedoraproject.org/updates/FEDORA-2023-1026cbdd9a
> But there are some gating issues (nothing to do with the new code
> though, it seems to be dnf5/annobin/rpminspect issues).
>
> I'll try to also push it to centos stream.
>
> Cheers,
>
> Mark

Thanks ;)