Performance issue with systemd-coredump and container process linking 2000 shared libraries.
Hi, I am reaching out the elfutils mailing to discuss one issue we have seen in one of our OpenShift production workload. Note that if you are a Red Hat employee, this actually corresponds to Red Hat support case 03536980 which I have yet to update with this informations. Description of the issue: While moving from OpenShift 4.12 to 4.13, we also move the system from a RHEL 8 base to a RHEL 9 one. Among the changes, it upgrades systemd to version 252. Historically systemd-coredump use to mind only about buildids when analysing core dumps, however now it also tries to parse the package metadata note section (cf https://systemd.io/ELF_PACKAGE_METADATA/). We use sytemd-coredump as our default core handler, and being an OpenShift node we run there mainly containers. As part of our container, we have a particular binary which links against close to 2000 shared libraries. And when this program crashes, it takes several minutes for systemd-coredump to "analyze" the core, which is spent in parsing the elf part of it. I have been able to extract from the systemd code (cf https://github.com/systemd/systemd/blob/22afd5cb2cf06418e5a84907b1ff782d29a33e47/src/shared/elf-util.c#L482 ) the logic of systemd into a simplified C program that links against systemd to show how "slow" is systemd. I will attach this program a bit after, and also explain a bit my findings and what "fix" I suggested internally we propose to systemd folks. First to reproduce, we can create a dummy binary that links against 2000 lib. Credits for finding and writing this simple reproducer goes to Francois Rigault, one of my colleague: cat > lib.c << EOF int TOTO = 42; EOF cat > generate-core.c << EOF #include int main() { printf("started\n"); int* i = 0; *i = 42; } EOF N_LIBS=2000 LIBS="" for i in {1..${N_LIBS}}; do [[ -f liba$i.so ]] || gcc -shared -fPIC -o liba$i.so lib.c -D TOTO=var$i done echo libs built gcc -o generate-core generate-core.c -L. -Wl,-rpath,$(pwd) -la{1..${N_LIBS}} echo running binary time ./generate-core # Here a core is generated, rename it to "the-core" in the current directory. # Finally, remove all libaN.so files, to simulate the fact that systemd-coredump # runs in a different mount filesystem as the container process which crashed # (otherwise the following reproducer might be "fast"): rm liba*.so Once a core dump has been generated, I tried to mimic the code of systemd-coredump. It looks like this: cat > mimic-systemd-coredump.c << EOF #include "elfutils/libdwfl.h" #include #include static const Dwfl_Callbacks callbacks = { .find_elf = dwfl_build_id_find_elf, .section_address = dwfl_offline_section_address, .find_debuginfo = dwfl_standard_find_debuginfo, }; void print_elf_error() { printf("Elf error: %s\n", elf_errmsg(elf_errno())); } void print_dwfl_error() { printf("Dwfl error: %s\n", dwfl_errmsg(dwfl_errno())); } // In Systemd, this tries to retrieve the Json package metadata note. // returns a value > 0 when data is found // returns 0 when no data is found, but no error happened // returns a value < 0 when a strong error happen and we have to stop. int parse_module_package_metadata_best_effort(Elf* elf, const char* name) { size_t n_program_headers; size_t r = elf_getphdrnum(elf, &n_program_headers); if (r == 0) { printf("Found %lli program header for module %s\n", n_program_headers, name); // Don't implement the full systemd logic here. return 1; } else { // Best effort parsing: silently ignore the error. return 0; } } static int module_callback(Dwfl_Module *mod __attribute__ ((unused)), void **userdata __attribute__ ((unused)), const char *name __attribute__ ((unused)), Dwarf_Addr start, void *arg) { Elf* elf = NULL; GElf_Addr bias; // Comment the next line to "improve" performances. elf = dwfl_module_getelf(mod, &bias); if (elf) { printf("dwfl_module_getelf has found the elf program %s\n", name); int ret = parse_module_package_metadata_best_effort(elf, name); if (ret > 0) return DWARF_CB_OK; else if (ret < 0) return DWARF_CB_ABORT; else elf = arg; } else { printf("dwfl_module_getelf returned NULL for module %s\n", name); //print_dwfl_error(); elf = arg; } size_t n_program_headers; assert(elf_getphdrnum(elf, &n_program_headers) >= 0); printf("iterate: %s: %lli\n", name, n_program_headers); for (size_t i = 0; i < n_program_headers; ++i) { GElf_Phdr mem, *program_header; Elf_Data *data; /* The core file stores the ELF files in the PT_LOAD segment. */ program_header = gelf_getphdr(elf, i, &mem); if (!program_header || program_header->p_type != PT_LOAD) continue; /* This PT_LOAD section doesn't contain the start address, so it can'
Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
> Le 19 juin 2023 à 17:08, Mark Wielaard a écrit : > > Hi Romain, > > Just to let you know I am looking at this. But haven't made much > progress in understanding it yet. Thanks so much for the reproducer. I > have been able to see the (very slow) parsing of the core file with it. Hi, Thanks ! And sorry that Laurent had pinged you directly on Slack, I wanted to reach you via this mailing list instead of through the Red Hat customer network ;) I don’t know if you read the Red Hat case too. There you can find things a bit more clarified, and splitted into what I think are potentially 3 distinct "problems" which 3 distinct possible fix. Since there is nothing private, I can write on this here as well on this public mailing list. So in the end I see 3 points (in addition to not understanding why finding the elf header returns NULL while it should not and which I guess you are currently looking at): - the idea that systemd developers should invert their logic: first try to parse elf/program headers from the (maybe partial) core dump PT_LOAD program headers - This special "if" condition that I have added in the original systemd code: +/* This PT_LOAD section doesn't contain the start address, so it can't be the module we are looking for. */ +if (start < program_header->p_vaddr || start >= program_header->p_vaddr + program_header->p_memsz) +continue; to be added near this line: https://github.com/systemd/systemd/blob/72e7bfe02d7814fff15602726c7218b389324159/src/shared/elf-util.c#L540 on which I would like to ask you if indeed it seems like a "right" fix with your knowledge of how core dump and elf files are shaped. - The idea that maybe this commit https://sourceware.org/git/?p=elfutils.git;a=commitdiff;h=8db849976f07046d27b4217e9ebd08d5623acc4f which assumed that normally the order of magnitude of program headers is 10 for a "normal" elf file, so a linked list would be enough might be wrong in the special case of core dump which may have much more program headers. And if indeed it makes sense to elf_getdata_rawchunk for each and every program header of a core, in that case should this linked list be changed into some set/hashmap indexed by start address/size ? > > $ time ./mimic-systemd-coredump > [...] > real3m35.965s > user0m0.722s > sys 0m0.345s > > Note however that a lot of time is "missing". > And in fact running it again is fast!?! > > $ time ./mimic-systemd-coredump > real0m0.327s > user0m0.272s > sys 0m0.050s > > This is because of the kernel inode/dentry cache. > If I do $ echo 2 | sudo tee /proc/sys/vm/drop_caches > before running ./mimic-systemd-coredump it is always slow. Interesting ! I didn’t see that (actually I never let the program run till the end !). > Which does bring up the question why systemd-coredump isn't running in > the same mount space as the crashing program. Then it would simply find > the files that the crashing program is using. On this point that systemd-coredump might not run in the same mount namespace, don’t blindly believe me. I think I saw this while reviewing the systemd code, but it was the first time I looked at it to investigate this issue, so may be wrong. But I am sure you have access to some systemd colleagues at Red Hat to double-check the details ;) Cheers, Romain
Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
> Le 20 juin 2023 à 23:37, Mark Wielaard a écrit : > > Hi, > > On Mon, Jun 19, 2023 at 05:08:50PM +0200, Mark Wielaard wrote: > > So I made a mistake here. Since I was testing on fedora 38 which has > DEBUGINFOD_URLS set. Without DEBUGINFOD_URLS set there is no big > slowdown. > > Do you have the DEBUGINFOD_URLS environment variable set? > > The real sd-coredump will not have DEBUGINFOD_URLS set (I hope). > > Thanks, > > Mark Hi, Our real use case happens on a Openshift 4.13 node, so the OS is Red Hat Core OS 9 (which I assume shares a lot of foundations with RHEL 9). On our side Francois also told me this afternoon that he didn’t really reproduce the same thing with my reproducer posted here and the real systemd-coredump issue he witnessed live, and also noticed that with DEBUGINFOD_URLS unset/set to the empty string my reproducer has no problem anymore. What he witnessed on the real case (using perf/gdb) was that apparently lots of time was spend in elf_getdata_rawchunk and often in this kind of stack: Samples: 65K of event 'cpu-clock:pppH', Event count (approx.): 1646850 Overhead Command Shared Object Symbol 98.24% (sd-parse-elf) libelf-0.188.so [.] elf_getdata_rawchunk 0.48% (sd-parse-elf) libelf-0.188.so [.] 0x48a3 0.27% (sd-parse-elf) libelf-0.188.so [.] gelf_getphdr 0.11% (sd-parse-elf) libc.so.6 [.] _int_malloc 0.10% (sd-parse-elf) libelf-0.188.so [.] gelf_getnote 0.06% (sd-parse-elf) libc.so.6 [.] __libc_calloc 0.05% (sd-parse-elf) [kernel.kallsyms] [k] __softirqentry_text_start 0.05% (sd-parse-elf) libc.so.6 [.] _int_free (gdb) bt #0 0x7f0ba8a88194 in elf_getdata_rawchunk () from target:/lib64/libelf.so.1 #1 0x7f0ba98e5013 in module_callback.lto_priv () from target:/usr/lib64/systemd/libsystemd-shared-252.so #2 0x7f0ba8ae7291 in dwfl_getmodules () from target:/lib64/libdw.so.1 #3 0x7f0ba98e6dc0 in parse_elf_object () from target:/usr/lib64/systemd/libsystemd-shared-252.so #4 0x562c474f2d5e in submit_coredump () #5 0x562c474f57d1 in process_socket.constprop () #6 0x562c474efbf8 in main () My reproducer actually doesn’t fully re-implement what systemd implements (the parsing of the package metadata is clearly omitted), so I thought I had reproduced the same problem while apparently I didn’t, sorry for that. We will also have to double check if really just using 2000 dummy libraries is enough or if this also needs to have a more complex binary like we have in our real case. Tomorrow on our side we will have to play a bit with a local build of systemd-coredump and try to run it manually to better understand what’s going wrong. Note: when I wrote and tested my reproducer, I used a fedora:38 container, which doesn’t have DEBUGINFOD_URLS set (which may be different from a real fedora 38, non containerized) [root@7563ccfb7a39 /]# printenv|grep DEBUGINFOD_URLS [root@7563ccfb7a39 /]# find /etc/profile.d/|grep debug [root@7563ccfb7a39 /]# cat /etc/os-release NAME="Fedora Linux" VERSION="38 (Container Image)" Cheers, Romain
Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
> Le 21 juin 2023 à 18:24, Mark Wielaard a écrit : > > > > Hi Romain, > > Seeing those performance results I understand why you were suspecting > the linked list data structure used in elf_getdata_rawchunk. > > Would you be able to test a rebuild libelf with the attached patch, > which replaces that datastructure with a binary search tree? > > It didn't really show much speedup locally (in the noise, maybe 0.01 > sec faster on ~0.25 sec run). But if there are more than 2000 calls to > elf_getdata_rawchunk it should make things faster. Hi Mark, Actually I have spent some time today to reproduce the issue for real, using directly systemd. The details can be found in https://github.com/systemd/systemd/pull/28093 which was just merged. This change in systemd is enough to fix the biggest part of the "slow" parsing of cores. Yet even with my systemd patch, we will still call elf_getdata_rawchunk 2000 times if the crashing process had 2000 shared libraries, so indeed your patch in elfutils might still be welcome. I will test it later when I go home, or tomorrow. Thanks, Romain
Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
> Le 21 juin 2023 à 21:39, Mark Wielaard a écrit : > > > Hi Romain, > > That patch looks good. It should reduce the number of calls to > elf_getdata_rawchunk a lot. Making it less urgent that function is > fast. But if you could test it that would be appreciated. I'll also > test it a bit more and will probably merge it if no issues show up > since it does seem better than keep using the linear list. > > Thanks, > > Mark Hi, So I have done some testing, running the command: [root@fa28b3d254fd systemd]# rm -rf /var/lib/systemd/coredump/* && journalctl --flush --rotate && journalctl --vacuum-time=1s && time cat the-core | build/systemd-coredump $$ 0 0 11 1687360485 1000 localhost && journalctl Where "the-core" is our real core dump we had in production, with around 1700+ shared libraries loaded, and the uncompressed core dump size is 2GB+. Without the systemd patch, without the elfutils patch. real3m42.308s user3m39.579s sys 0m2.294s Without the systemd patch, with the elfutils patch (3 runs, first one excluded to make sure the kernel caches what it caches): real0m15.543s user0m13.662s sys 0m2.405s real0m15.976s user0m13.832s sys 0m2.481s real0m15.612s user0m13.687s sys 0m2.470s With the systemd patch, without the elf utils patch (3 runs, first one excluded to make sure the kernel caches what it caches): real0m2.994s user0m1.104s sys 0m2.477s real0m3.011s user0m1.154s sys 0m2.447s real0m3.009s user0m1.141s sys 0m2.457s With the systemd patch, with the elf utils patch (3 runs, first one excluded to make sure the kernel caches what it caches): real0m2.921s user0m1.093s sys 0m2.399s real0m2.950s user0m1.129s sys 0m2.401s real0m2.933s user0m1.136s sys 0m2.371s Overall we can see that both fix independently fix the performance issue (yet the systemd patch seems a bit more effective), so I guess both fixes are welcome. Mark, do you think it's worth backporting this in CentOS Steam 9/RHEL 9 on elfutils side ? If you need a ticket, we have Red Hat case 03536980 which lead to RHBZ 2215412. Thanks ! Romain
Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
> Le 27 juin 2023 à 16:09, Mark Wielaard a écrit : > > Hi, > > On Wed, 2023-06-21 at 18:24 +0200, Mark Wielaard wrote: > > I pushed this now. It is also in Fedora Rawhide: > https://bodhi.fedoraproject.org/updates/FEDORA-2023-1026cbdd9a > But there are some gating issues (nothing to do with the new code > though, it seems to be dnf5/annobin/rpminspect issues). > > I'll try to also push it to centos stream. > > Cheers, > > Mark Thanks ;)