Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
> Le 21 juin 2023 à 21:39, Mark Wielaard a écrit : > > > Hi Romain, > > That patch looks good. It should reduce the number of calls to > elf_getdata_rawchunk a lot. Making it less urgent that function is > fast. But if you could test it that would be appreciated. I'll also > test it a bit more and will probably merge it if no issues show up > since it does seem better than keep using the linear list. > > Thanks, > > Mark Hi, So I have done some testing, running the command: [root@fa28b3d254fd systemd]# rm -rf /var/lib/systemd/coredump/* && journalctl --flush --rotate && journalctl --vacuum-time=1s && time cat the-core | build/systemd-coredump $$ 0 0 11 1687360485 1000 localhost && journalctl Where "the-core" is our real core dump we had in production, with around 1700+ shared libraries loaded, and the uncompressed core dump size is 2GB+. Without the systemd patch, without the elfutils patch. real3m42.308s user3m39.579s sys 0m2.294s Without the systemd patch, with the elfutils patch (3 runs, first one excluded to make sure the kernel caches what it caches): real0m15.543s user0m13.662s sys 0m2.405s real0m15.976s user0m13.832s sys 0m2.481s real0m15.612s user0m13.687s sys 0m2.470s With the systemd patch, without the elf utils patch (3 runs, first one excluded to make sure the kernel caches what it caches): real0m2.994s user0m1.104s sys 0m2.477s real0m3.011s user0m1.154s sys 0m2.447s real0m3.009s user0m1.141s sys 0m2.457s With the systemd patch, with the elf utils patch (3 runs, first one excluded to make sure the kernel caches what it caches): real0m2.921s user0m1.093s sys 0m2.399s real0m2.950s user0m1.129s sys 0m2.401s real0m2.933s user0m1.136s sys 0m2.371s Overall we can see that both fix independently fix the performance issue (yet the systemd patch seems a bit more effective), so I guess both fixes are welcome. Mark, do you think it's worth backporting this in CentOS Steam 9/RHEL 9 on elfutils side ? If you need a ticket, we have Red Hat case 03536980 which lead to RHBZ 2215412. Thanks ! Romain
Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
Hi Frank, On Wed, 2023-06-21 at 22:40 -0400, Frank Ch. Eigler wrote: > For an application that processes these elf/dwarf files sequentially, > queries for each synthetic solib are going to result in 2000 https-404 > transactions, sans debuginfod caching. If you're lucky (reusing a > dwfl object), elfutils may be able to reuse a long-lived https > connection to a server, otherwise a new https connection might have to > be spun up for each. But even with reuse, we're talking about 2000 > pingponging messages. That will take a handful of minutes of elapsed > time just by itself. > > If the calling code made these queries in parallel batches, it would > be much faster overall. I have been thinking about this too. But don't know of a good solution that doesn't negate the (iterative) lazy model that Dwfl uses. libdwfl tries to do the least amount of work possible so that you don't "pay" for looking up extra information for Dwfl_Modules (libraries) you don't care about. So for the use case of getting a backtrace of a particular thread in that core file (Dwfl) you only fetch/load the register notes and stack memory from the core file. Then, only if you translate those stack addresses to symbols, only for those modules associated with those stack addresses it will try to fetch/load the symbol tables, which might involve resolving build-ids to Elf or separate Dwarf debug files. This is normally an iterative process and for something like generating a backtrace it often involves just a handful of Dwfl_Modules (libraries), not all 2000. In this case this falls down a bit since the application creates a Dwfl from a core file and then requests information (the elf file) from all Dwfl_Modules, so it can get at the package note description for each of them. As Romain noted it would be really nice if elfutils/libdwfl had a way to get at the package note description (just like it has for getting the build-id, which is another loaded ELF note). So such a function would know it doesn't need to get the full ELF file. Maybe another solution might be an "get me everything for this Dwfl, all symbol tables, all elf files, all separate Dwarf debug files, etc." function so an application can "pay upfront" for not having to fetch each item lazily? Such a function could then do a "parallel/batched fetch" through debuginfod. Cheers, Mark
Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.
Hi Romain, On Thu, 2023-06-22 at 08:10 +, Romain GEISSLER via Elfutils-devel wrote: > > Overall we can see that both fix independently fix the performance issue (yet > the systemd patch seems a bit more effective), so I guess both fixes are > welcome. Yes, thanks so much for the testing of the individual fixes. > Mark, do you think it's worth backporting this in CentOS Steam 9/RHEL 9 on > elfutils side ? If you need a ticket, we have Red Hat case 03536980 which > lead to RHBZ 2215412. Yes, I'll look into it. The systemd fix seems most important since it also fixes a correctness issue and fixes the root cause of the performance issue. But I'll see if we can backport the elfutils fix too. I can add it to Fedora first. Thanks, Mark
[COMMITTED] debuginfod: Fix formatting in debuginfod_config_cache
The formatting of debuginfod_config_cache in debuginfod-client.c was slightly off making it hard to see the program logic. Make sure lines are < 76 chars, and if { } else { } indentation follows GNU style. Signed-off-by: Mark Wielaard --- debuginfod/debuginfod-client.c | 32 ++-- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/debuginfod/debuginfod-client.c b/debuginfod/debuginfod-client.c index cb28f1d0..d92d8d62 100644 --- a/debuginfod/debuginfod-client.c +++ b/debuginfod/debuginfod-client.c @@ -277,23 +277,27 @@ debuginfod_config_cache(debuginfod_client *c, char *config_path, } long cache_config; - /* PR29696 - NB: When using fdopen, the file descriptor is NOT dup'ed and will - be closed when the stream is closed. Manually closing fd after fclose - is called will lead to a race condition where, if reused, the file descriptor will - compete for its regular use before being incorrectly closed here. - */ + /* PR29696 - NB: When using fdopen, the file descriptor is NOT + dup'ed and will be closed when the stream is closed. Manually + closing fd after fclose is called will lead to a race condition + where, if reused, the file descriptor will compete for its + regular use before being incorrectly closed here. */ FILE *config_file = fdopen(fd, "r"); if (config_file) - { -if (fscanf(config_file, "%ld", &cache_config) != 1) +{ + if (fscanf(config_file, "%ld", &cache_config) != 1) + cache_config = cache_config_default_s; + if (0 != fclose (config_file) && c->verbose_fd >= 0) + dprintf (c->verbose_fd, "fclose failed with %s (err=%d)\n", +strerror (errno), errno); +} + else +{ cache_config = cache_config_default_s; -if(0 != fclose(config_file) && c->verbose_fd >= 0) - dprintf (c->verbose_fd, "fclose failed with %s (err=%d)\n", strerror (errno), errno); - }else{ -cache_config = cache_config_default_s; -if(0 != close(fd) && c->verbose_fd >= 0) - dprintf (c->verbose_fd, "close failed with %s (err=%d)\n", strerror (errno), errno); - } + if (0 != close (fd) && c->verbose_fd >= 0) + dprintf (c->verbose_fd, "close failed with %s (err=%d)\n", +strerror (errno), errno); +} return cache_config; } -- 2.40.1