Re: Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
Hi Luca, On Mon, 2023-06-19 at 21:54 +0100, Luca Boccassi wrote: > > > Which does bring up the question why systemd-coredump isn't running > > in > > > the same mount space as the crashing program. Then it would simply > > find > > > the files that the crashing program is using. > > > > On this point that systemd-coredump might not run in the same mount > > namespace, don’t blindly believe me. I think I saw this while > > reviewing the > > systemd code, but it was the first time I looked at it to investigate > > this issue, > > so may be wrong. > > This is correct, in case of containers sd-coredump will run on the host > and collect from all the guests, so they are going to be in different > namespaces. And even when they are not, the original binary might be > long gone by the time it has a chance to run. Since sd-coredump is run on the host it could transition into the mount namespace of the process it is capturing the coredump for. This makes sense if it is then going to do some path based lookups. Alternatively we could look into making the path based lookups look under /proc/PID/root/ Since sd-coredump is run as special kernel helper through /proc/sys/kernel/core_pattern the original executable is still there (it cannot be reaped till sd-coredump has done its thing). The core (5) manpage states: * The process runs in the initial namespaces (PID, mount, user, and so on) and not in the namespaces of the crashing process. One can utilize specifiers such as %P to find the right /proc/[pid] directory and probe/enter the crashing process's namespaces if needed. https://man7.org/linux/man-pages/man5/core.5.html Cheers, Mark
Re: Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
On Tue, 20 Jun 2023 at 13:07, Mark Wielaard wrote: > > Hi Luca, > > On Mon, 2023-06-19 at 21:54 +0100, Luca Boccassi wrote: > > > > Which does bring up the question why systemd-coredump isn't running > > > in > > > > the same mount space as the crashing program. Then it would simply > > > find > > > > the files that the crashing program is using. > > > > > > On this point that systemd-coredump might not run in the same mount > > > namespace, don’t blindly believe me. I think I saw this while > > > reviewing the > > > systemd code, but it was the first time I looked at it to investigate > > > this issue, > > > so may be wrong. > > > > This is correct, in case of containers sd-coredump will run on the host > > and collect from all the guests, so they are going to be in different > > namespaces. And even when they are not, the original binary might be > > long gone by the time it has a chance to run. > > Since sd-coredump is run on the host it could transition into the mount > namespace of the process it is capturing the coredump for. This makes > sense if it is then going to do some path based lookups. Alternatively > we could look into making the path based lookups look under > /proc/PID/root/ Since sd-coredump is run as special kernel helper > through /proc/sys/kernel/core_pattern the original executable is still > there (it cannot be reaped till sd-coredump has done its thing). This requires additional privileges that we really don't want to give to sd-coredump, as it fundamentally processes untrusted inputs. Also as mentioned there's no guarantee anyway that the original binary will still be around when the processing happens, it's all asynchronous, so there would still be common corner cases where that doesn't help. There were other suggestions in this thread to fix this issue, to me it seems better to pursue those instead. Kind regards, Luca Boccassi
Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
Hi Romain, On Mon, 2023-06-19 at 19:56 +, Romain GEISSLER via Elfutils-devel wrote: > > Thanks ! And sorry that Laurent had pinged you directly on Slack, I > wanted to reach you via this mailing list instead of through the Red > Hat customer network ;) Slack isn't a very effective way to reach me. Most elfutils hackers do hang out on the Libera.Chat irc channel #elfutils. > I don’t know if you read the Red Hat case too. There you can find > things a bit more clarified, and splitted into what I think are potentially > 3 distinct "problems" which 3 distinct possible fix. Since there is nothing > private, I can write on this here as well on this public mailing list. I haven't looked if I have access to the customer case since you provided such a great reproducer. > So in the end I see 3 points (in addition to not understanding why > finding the elf header returns NULL while it should not and which I > guess you are currently looking at): > - the idea that systemd developers should invert their logic: first > try to parse elf/program headers from the (maybe partial) core dump > PT_LOAD program headers yes, that could in theory also be done through a custom callbacks- >find_elf. > - This special "if" condition that I have added in the original systemd > code: > > +/* This PT_LOAD section doesn't contain the start address, > so it can't be the module we are looking for. */ > +if (start < program_header->p_vaddr || start >= > program_header->p_vaddr + program_header->p_memsz) > +continue; > > to be added near this line: > https://github.com/systemd/systemd/blob/72e7bfe02d7814fff15602726c7218b389324159/src/shared/elf-util.c#L540 > > on which I would like to ask you if indeed it seems like a "right" fix with > your knowledge of how core dump and elf files are shaped. Yes, that does make sense. > - The idea that maybe this commit > https://sourceware.org/git/?p=elfutils.git;a=commitdiff;h=8db849976f07046d27b4217e9ebd08d5623acc4f > which assumed that normally the order of magnitude of program headers > is 10 for a "normal" elf file, so a linked list would be enough might be > wrong in the special case of core dump which may have much more > program headers. And if indeed it makes sense to elf_getdata_rawchunk > for each and every program header of a core, in that case should this > linked list be changed into some set/hashmap indexed by start > address/size ? Interesting. Yeah, a linked list is not the ideal datastructure here. But I don't think it is causing the really long delay. That clearly comes from the (negative) inode/dentry file search cache. But we could look at this after we solve the other issues and we maybe want to speed things up a bit more. > Cheers, Mark
Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
Hi, On Mon, Jun 19, 2023 at 05:08:50PM +0200, Mark Wielaard wrote: > Just to let you know I am looking at this. But haven't made much > progress in understanding it yet. Thanks so much for the reproducer. I > have been able to see the (very slow) parsing of the core file with it. > > $ time ./mimic-systemd-coredump > [...] > real 3m35.965s > user 0m0.722s > sys 0m0.345s > > Note however that a lot of time is "missing". > And in fact running it again is fast!?! > > $ time ./mimic-systemd-coredump > real 0m0.327s > user 0m0.272s > sys 0m0.050s > > This is because of the kernel inode/dentry cache. > If I do $ echo 2 | sudo tee /proc/sys/vm/drop_caches > before running ./mimic-systemd-coredump it is always slow. > > I'll try to figure out what we do to make it so hard for the kernel to > do these lookups. So I made a mistake here. Since I was testing on fedora 38 which has DEBUGINFOD_URLS set. Without DEBUGINFOD_URLS set there is no big slowdown. Do you have the DEBUGINFOD_URLS environment variable set? The real sd-coredump will not have DEBUGINFOD_URLS set (I hope). Thanks, Mark
Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
> Le 20 juin 2023 à 23:37, Mark Wielaard a écrit : > > Hi, > > On Mon, Jun 19, 2023 at 05:08:50PM +0200, Mark Wielaard wrote: > > So I made a mistake here. Since I was testing on fedora 38 which has > DEBUGINFOD_URLS set. Without DEBUGINFOD_URLS set there is no big > slowdown. > > Do you have the DEBUGINFOD_URLS environment variable set? > > The real sd-coredump will not have DEBUGINFOD_URLS set (I hope). > > Thanks, > > Mark Hi, Our real use case happens on a Openshift 4.13 node, so the OS is Red Hat Core OS 9 (which I assume shares a lot of foundations with RHEL 9). On our side Francois also told me this afternoon that he didn’t really reproduce the same thing with my reproducer posted here and the real systemd-coredump issue he witnessed live, and also noticed that with DEBUGINFOD_URLS unset/set to the empty string my reproducer has no problem anymore. What he witnessed on the real case (using perf/gdb) was that apparently lots of time was spend in elf_getdata_rawchunk and often in this kind of stack: Samples: 65K of event 'cpu-clock:pppH', Event count (approx.): 1646850 Overhead Command Shared Object Symbol 98.24% (sd-parse-elf) libelf-0.188.so [.] elf_getdata_rawchunk 0.48% (sd-parse-elf) libelf-0.188.so [.] 0x48a3 0.27% (sd-parse-elf) libelf-0.188.so [.] gelf_getphdr 0.11% (sd-parse-elf) libc.so.6 [.] _int_malloc 0.10% (sd-parse-elf) libelf-0.188.so [.] gelf_getnote 0.06% (sd-parse-elf) libc.so.6 [.] __libc_calloc 0.05% (sd-parse-elf) [kernel.kallsyms] [k] __softirqentry_text_start 0.05% (sd-parse-elf) libc.so.6 [.] _int_free (gdb) bt #0 0x7f0ba8a88194 in elf_getdata_rawchunk () from target:/lib64/libelf.so.1 #1 0x7f0ba98e5013 in module_callback.lto_priv () from target:/usr/lib64/systemd/libsystemd-shared-252.so #2 0x7f0ba8ae7291 in dwfl_getmodules () from target:/lib64/libdw.so.1 #3 0x7f0ba98e6dc0 in parse_elf_object () from target:/usr/lib64/systemd/libsystemd-shared-252.so #4 0x562c474f2d5e in submit_coredump () #5 0x562c474f57d1 in process_socket.constprop () #6 0x562c474efbf8 in main () My reproducer actually doesn’t fully re-implement what systemd implements (the parsing of the package metadata is clearly omitted), so I thought I had reproduced the same problem while apparently I didn’t, sorry for that. We will also have to double check if really just using 2000 dummy libraries is enough or if this also needs to have a more complex binary like we have in our real case. Tomorrow on our side we will have to play a bit with a local build of systemd-coredump and try to run it manually to better understand what’s going wrong. Note: when I wrote and tested my reproducer, I used a fedora:38 container, which doesn’t have DEBUGINFOD_URLS set (which may be different from a real fedora 38, non containerized) [root@7563ccfb7a39 /]# printenv|grep DEBUGINFOD_URLS [root@7563ccfb7a39 /]# find /etc/profile.d/|grep debug [root@7563ccfb7a39 /]# cat /etc/os-release NAME="Fedora Linux" VERSION="38 (Container Image)" Cheers, Romain