Re: Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.

2023-06-20 Thread Mark Wielaard
Hi Luca,

On Mon, 2023-06-19 at 21:54 +0100, Luca Boccassi wrote:
> > > Which does bring up the question why systemd-coredump isn't running
> > in
> > > the same mount space as the crashing program. Then it would simply
> > find
> > > the files that the crashing program is using.
> > 
> > On this point that systemd-coredump might not run in the same mount
> > namespace, don’t blindly believe me. I think I saw this while
> > reviewing the
> > systemd code, but it was the first time I looked at it to investigate
> > this issue,
> > so may be wrong.
> 
> This is correct, in case of containers sd-coredump will run on the host
> and collect from all the guests, so they are going to be in different
> namespaces. And even when they are not, the original binary might be
> long gone by the time it has a chance to run.

Since sd-coredump is run on the host it could transition into the mount
namespace of the process it is capturing the coredump for. This makes
sense if it is then going to do some path based lookups. Alternatively
we could look into making the path based lookups look under
/proc/PID/root/ Since sd-coredump is run as special kernel helper
through /proc/sys/kernel/core_pattern the original executable is still
there (it cannot be reaped till sd-coredump has done its thing).

The core (5) manpage states:

   *  The process runs in the initial namespaces (PID, mount, user,
  and so on) and not in the namespaces of the crashing process.
  One can utilize specifiers such as %P to find the right
  /proc/[pid] directory and probe/enter the crashing process's
  namespaces if needed.

https://man7.org/linux/man-pages/man5/core.5.html

Cheers,

Mark


Re: Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.

2023-06-20 Thread Luca Boccassi
On Tue, 20 Jun 2023 at 13:07, Mark Wielaard  wrote:
>
> Hi Luca,
>
> On Mon, 2023-06-19 at 21:54 +0100, Luca Boccassi wrote:
> > > > Which does bring up the question why systemd-coredump isn't running
> > > in
> > > > the same mount space as the crashing program. Then it would simply
> > > find
> > > > the files that the crashing program is using.
> > >
> > > On this point that systemd-coredump might not run in the same mount
> > > namespace, don’t blindly believe me. I think I saw this while
> > > reviewing the
> > > systemd code, but it was the first time I looked at it to investigate
> > > this issue,
> > > so may be wrong.
> >
> > This is correct, in case of containers sd-coredump will run on the host
> > and collect from all the guests, so they are going to be in different
> > namespaces. And even when they are not, the original binary might be
> > long gone by the time it has a chance to run.
>
> Since sd-coredump is run on the host it could transition into the mount
> namespace of the process it is capturing the coredump for. This makes
> sense if it is then going to do some path based lookups. Alternatively
> we could look into making the path based lookups look under
> /proc/PID/root/ Since sd-coredump is run as special kernel helper
> through /proc/sys/kernel/core_pattern the original executable is still
> there (it cannot be reaped till sd-coredump has done its thing).

This requires additional privileges that we really don't want to give
to sd-coredump, as it fundamentally processes untrusted inputs. Also
as mentioned there's no guarantee anyway that the original binary will
still be around when the processing happens, it's all asynchronous, so
there would still be common corner cases where that doesn't help.
There were other suggestions in this thread to fix this issue, to me
it seems better to pursue those instead.

Kind regards,
Luca Boccassi


Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.

2023-06-20 Thread Mark Wielaard
Hi Romain,

On Mon, 2023-06-19 at 19:56 +, Romain GEISSLER via Elfutils-devel
wrote:
> 
> Thanks ! And sorry that Laurent had pinged you directly on Slack, I
> wanted to reach you via this mailing list instead of through the Red
> Hat customer network ;)

Slack isn't a very effective way to reach me. Most elfutils hackers do
hang out on the Libera.Chat irc channel #elfutils.

> I don’t know if you read the Red Hat case too. There you can find
> things a bit more clarified, and splitted into what I think are potentially
> 3 distinct "problems" which 3 distinct possible fix. Since there is nothing
> private, I can write on this here as well on this public mailing list.

I haven't looked if I have access to the customer case since you
provided such a great reproducer.

> So in the end I see 3 points (in addition to not understanding why
> finding the elf header returns NULL while it should not and which I
> guess you are currently looking at):
>  - the idea that systemd developers should invert their logic: first
> try to parse elf/program headers from the (maybe partial) core dump
> PT_LOAD program headers

yes, that could in theory also be done through a custom callbacks-
>find_elf.

>  - This special "if" condition that I have added in the original systemd
> code:
> 
> +/* This PT_LOAD section doesn't contain the start address, 
> so it can't be the module we are looking for. */
> +if (start < program_header->p_vaddr || start >= 
> program_header->p_vaddr + program_header->p_memsz)
> +continue;
> 
> to be added near this line: 
> https://github.com/systemd/systemd/blob/72e7bfe02d7814fff15602726c7218b389324159/src/shared/elf-util.c#L540
> 
> on which I would like to ask you if indeed it seems like a "right" fix with
> your knowledge of how core dump and elf files are shaped.

Yes, that does make sense.

>  - The idea that maybe this commit 
> https://sourceware.org/git/?p=elfutils.git;a=commitdiff;h=8db849976f07046d27b4217e9ebd08d5623acc4f
> which assumed that normally the order of magnitude of program headers
> is 10 for a "normal" elf file, so a linked list would be enough might be
> wrong in the special case of core dump which may have much more
> program headers. And if indeed it makes sense to elf_getdata_rawchunk
> for each and every program header of a core, in that case should this
> linked list be changed into some set/hashmap indexed by start
> address/size ?

Interesting. Yeah, a linked list is not the ideal datastructure here.
But I don't think it is causing the really long delay. That clearly
comes from the (negative) inode/dentry file search cache. But we could
look at this after we solve the other issues and we maybe want to speed
things up a bit more.

> 
Cheers,

Mark


Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.

2023-06-20 Thread Mark Wielaard
Hi,

On Mon, Jun 19, 2023 at 05:08:50PM +0200, Mark Wielaard wrote:
> Just to let you know I am looking at this. But haven't made much
> progress in understanding it yet. Thanks so much for the reproducer. I
> have been able to see the (very slow) parsing of the core file with it.
> 
> $ time ./mimic-systemd-coredump
> [...]
> real  3m35.965s
> user  0m0.722s
> sys   0m0.345s
> 
> Note however that a lot of time is "missing".
> And in fact running it again is fast!?!
> 
> $ time ./mimic-systemd-coredump
> real  0m0.327s
> user  0m0.272s
> sys   0m0.050s
> 
> This is because of the kernel inode/dentry cache.
> If I do $ echo 2 | sudo tee /proc/sys/vm/drop_caches
> before running ./mimic-systemd-coredump it is always slow.
> 
> I'll try to figure out what we do to make it so hard for the kernel to
> do these lookups.

So I made a mistake here. Since I was testing on fedora 38 which has
DEBUGINFOD_URLS set. Without DEBUGINFOD_URLS set there is no big
slowdown.

Do you have the DEBUGINFOD_URLS environment variable set?

The real sd-coredump will not have DEBUGINFOD_URLS set (I hope).

Thanks,

Mark


Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.

2023-06-20 Thread Romain GEISSLER via Elfutils-devel
> Le 20 juin 2023 à 23:37, Mark Wielaard  a écrit :
> 
> Hi,
> 
> On Mon, Jun 19, 2023 at 05:08:50PM +0200, Mark Wielaard wrote:
> 
> So I made a mistake here. Since I was testing on fedora 38 which has
> DEBUGINFOD_URLS set. Without DEBUGINFOD_URLS set there is no big
> slowdown.
> 
> Do you have the DEBUGINFOD_URLS environment variable set?
> 
> The real sd-coredump will not have DEBUGINFOD_URLS set (I hope).
> 
> Thanks,
> 
> Mark

Hi,

Our real use case happens on a Openshift 4.13 node, so the OS is Red Hat Core 
OS 9 (which I assume shares a lot of foundations with RHEL 9).

On our side Francois also told me this afternoon that he didn’t really 
reproduce the same thing with my reproducer posted here and the real 
systemd-coredump issue he witnessed live, and also noticed that with 
DEBUGINFOD_URLS unset/set to the empty string my reproducer has no problem 
anymore. What he witnessed on the real case (using perf/gdb) was that 
apparently lots of time was spend in elf_getdata_rawchunk and often in this 
kind of stack:

Samples: 65K of event 'cpu-clock:pppH', Event count (approx.): 1646850  

   
Overhead  Command Shared Object Symbol  

   
  98.24%  (sd-parse-elf)  libelf-0.188.so   [.] elf_getdata_rawchunk
   0.48%  (sd-parse-elf)  libelf-0.188.so   [.] 0x48a3
   0.27%  (sd-parse-elf)  libelf-0.188.so   [.] gelf_getphdr
   0.11%  (sd-parse-elf)  libc.so.6 [.] _int_malloc
   0.10%  (sd-parse-elf)  libelf-0.188.so   [.] gelf_getnote
   0.06%  (sd-parse-elf)  libc.so.6 [.] __libc_calloc
   0.05%  (sd-parse-elf)  [kernel.kallsyms] [k] 
__softirqentry_text_start
   0.05%  (sd-parse-elf)  libc.so.6 [.] _int_free


(gdb) bt
#0  0x7f0ba8a88194 in elf_getdata_rawchunk () from target:/lib64/libelf.so.1
#1  0x7f0ba98e5013 in module_callback.lto_priv () from 
target:/usr/lib64/systemd/libsystemd-shared-252.so
#2  0x7f0ba8ae7291 in dwfl_getmodules () from target:/lib64/libdw.so.1
#3  0x7f0ba98e6dc0 in parse_elf_object () from 
target:/usr/lib64/systemd/libsystemd-shared-252.so
#4  0x562c474f2d5e in submit_coredump ()
#5  0x562c474f57d1 in process_socket.constprop ()
#6  0x562c474efbf8 in main ()

My reproducer actually doesn’t fully re-implement what systemd implements (the 
parsing of the package metadata is clearly omitted), so I thought I had 
reproduced the same problem while apparently I didn’t, sorry for that. We will 
also have to double check if really just using 2000 dummy libraries is enough 
or if this also needs to have a more complex binary like we have in our real 
case.

Tomorrow on our side we will have to play a bit with a local build of 
systemd-coredump and try to run it manually to better understand what’s going 
wrong.


Note: when I wrote and tested my reproducer, I used a fedora:38 container, 
which doesn’t have DEBUGINFOD_URLS set (which may be different from a real 
fedora 38, non containerized)

[root@7563ccfb7a39 /]# printenv|grep DEBUGINFOD_URLS
[root@7563ccfb7a39 /]# find /etc/profile.d/|grep debug
[root@7563ccfb7a39 /]# cat /etc/os-release
NAME="Fedora Linux"
VERSION="38 (Container Image)"

Cheers,
Romain