Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.

2023-06-22 Thread Romain GEISSLER via Elfutils-devel
> Le 21 juin 2023 à 21:39, Mark Wielaard  a écrit :
> 
> 
> Hi Romain,
> 
> That patch looks good. It should reduce the number of calls to
> elf_getdata_rawchunk a lot. Making it less urgent that function is
> fast. But if you could test it that would be appreciated. I'll also
> test it a bit more and will probably merge it if no issues show up
> since it does seem better than keep using the linear list.
> 
> Thanks,
> 
> Mark


Hi,

So I have done some testing, running the command:

[root@fa28b3d254fd systemd]# rm -rf /var/lib/systemd/coredump/* && journalctl 
--flush --rotate && journalctl --vacuum-time=1s && time cat the-core | 
build/systemd-coredump $$ 0 0 11 1687360485 1000 localhost && 
journalctl

Where "the-core" is our real core dump we had in production, with around 1700+ 
shared libraries loaded, and the uncompressed core dump size is 2GB+.

Without the systemd patch, without the elfutils patch.
real3m42.308s
user3m39.579s
sys 0m2.294s


Without the systemd patch, with the elfutils patch (3 runs, first one excluded 
to make sure the kernel caches what it caches):
real0m15.543s
user0m13.662s
sys 0m2.405s

real0m15.976s 
user0m13.832s 
sys 0m2.481s

real0m15.612s
user0m13.687s 
sys 0m2.470s


With the systemd patch, without the elf utils patch (3 runs, first one excluded 
to make sure the kernel caches what it caches):
real0m2.994s  
user0m1.104s
sys 0m2.477s

real0m3.011s  
user0m1.154s  
sys 0m2.447s

real0m3.009s
user0m1.141s  
sys 0m2.457s


With the systemd patch, with the elf utils patch (3 runs, first one excluded to 
make sure the kernel caches what it caches):
real0m2.921s  
user0m1.093s
sys 0m2.399s

real0m2.950s
user0m1.129s  
sys 0m2.401s

real0m2.933s  
user0m1.136s  
sys 0m2.371s


Overall we can see that both fix independently fix the performance issue (yet 
the systemd patch seems a bit more effective), so I guess both fixes are 
welcome.

Mark, do you think it's worth backporting this in CentOS Steam 9/RHEL 9 on 
elfutils side ? If you need a ticket, we have Red Hat case 03536980 which lead 
to RHBZ 2215412.

Thanks !
Romain


Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.

2023-06-22 Thread Mark Wielaard
Hi Frank,

On Wed, 2023-06-21 at 22:40 -0400, Frank Ch. Eigler wrote:
> For an application that processes these elf/dwarf files sequentially,
> queries for each synthetic solib are going to result in 2000 https-404
> transactions, sans debuginfod caching.  If you're lucky (reusing a
> dwfl object), elfutils may be able to reuse a long-lived https
> connection to a server, otherwise a new https connection might have to
> be spun up for each.  But even with reuse, we're talking about 2000
> pingponging messages.  That will take a handful of minutes of elapsed
> time just by itself.
> 
> If the calling code made these queries in parallel batches, it would
> be much faster overall.

I have been thinking about this too. But don't know of a good solution
that doesn't negate the (iterative) lazy model that Dwfl uses.

libdwfl tries to do the least amount of work possible so that you don't
"pay" for looking up extra information for Dwfl_Modules (libraries) you
don't care about. So for the use case of getting a backtrace of a
particular thread in that core file (Dwfl) you only fetch/load the
register notes and stack memory from the core file. Then, only if you
translate those stack addresses to symbols, only for those modules
associated with those stack addresses it will try to fetch/load the
symbol tables, which might involve resolving build-ids to Elf or
separate Dwarf debug files. This is normally an iterative process and
for something like generating a backtrace it often involves just a
handful of Dwfl_Modules (libraries), not all 2000.

In this case this falls down a bit since the application creates a Dwfl
from a core file and then requests information (the elf file) from all
Dwfl_Modules, so it can get at the package note description for each of
them. As Romain noted it would be really nice if elfutils/libdwfl had a
way to get at the package note description (just like it has for
getting the build-id, which is another loaded ELF note). So such a
function would know it doesn't need to get the full ELF file.

Maybe another solution might be an "get me everything for this Dwfl,
all symbol tables, all elf files, all separate Dwarf debug files, etc."
function so an application can "pay upfront" for not having to fetch
each item lazily? Such a function could then do a "parallel/batched
fetch" through debuginfod.

Cheers,

Mark


Re: Performance issue with systemd-coredump and container process linking 2000 shared libraries.

2023-06-22 Thread Mark Wielaard
Hi Romain,

On Thu, 2023-06-22 at 08:10 +, Romain GEISSLER via Elfutils-devel
wrote:

> 
> Overall we can see that both fix independently fix the performance issue (yet 
> the systemd patch seems a bit more effective), so I guess both fixes are 
> welcome.

Yes, thanks so much for the testing of the individual fixes.

> Mark, do you think it's worth backporting this in CentOS Steam 9/RHEL 9 on 
> elfutils side ? If you need a ticket, we have Red Hat case 03536980 which 
> lead to RHBZ 2215412.

Yes, I'll look into it. The systemd fix seems most important since it
also fixes a correctness issue and fixes the root cause of the
performance issue. But I'll see if we can backport the elfutils fix
too. I can add it to Fedora first.

Thanks,

Mark


[COMMITTED] debuginfod: Fix formatting in debuginfod_config_cache

2023-06-22 Thread Mark Wielaard
The formatting of debuginfod_config_cache in debuginfod-client.c was
slightly off making it hard to see the program logic. Make sure lines
are < 76 chars, and if { } else { } indentation follows GNU style.

Signed-off-by: Mark Wielaard 
---
 debuginfod/debuginfod-client.c | 32 ++--
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/debuginfod/debuginfod-client.c b/debuginfod/debuginfod-client.c
index cb28f1d0..d92d8d62 100644
--- a/debuginfod/debuginfod-client.c
+++ b/debuginfod/debuginfod-client.c
@@ -277,23 +277,27 @@ debuginfod_config_cache(debuginfod_client *c, char 
*config_path,
 }
 
   long cache_config;
-  /* PR29696 - NB: When using fdopen, the file descriptor is NOT dup'ed and 
will
-  be closed when the stream is closed. Manually closing fd after fclose
-  is called will lead to a race condition where, if reused, the file 
descriptor will
-  compete for its regular use before being incorrectly closed here.
-  */
+  /* PR29696 - NB: When using fdopen, the file descriptor is NOT
+ dup'ed and will be closed when the stream is closed. Manually
+ closing fd after fclose is called will lead to a race condition
+ where, if reused, the file descriptor will compete for its
+ regular use before being incorrectly closed here.  */
   FILE *config_file = fdopen(fd, "r");
   if (config_file)
-  {
-if (fscanf(config_file, "%ld", &cache_config) != 1)
+{
+  if (fscanf(config_file, "%ld", &cache_config) != 1)
+   cache_config = cache_config_default_s;
+  if (0 != fclose (config_file) && c->verbose_fd >= 0)
+   dprintf (c->verbose_fd, "fclose failed with %s (err=%d)\n",
+strerror (errno), errno);
+}
+  else
+{
   cache_config = cache_config_default_s;
-if(0 != fclose(config_file) && c->verbose_fd >= 0)
-  dprintf (c->verbose_fd, "fclose failed with %s (err=%d)\n", strerror 
(errno), errno);
-  }else{
-cache_config = cache_config_default_s;
-if(0 != close(fd) && c->verbose_fd >= 0)
-  dprintf (c->verbose_fd, "close failed with %s (err=%d)\n", strerror 
(errno), errno);
-  }
+  if (0 != close (fd) && c->verbose_fd >= 0)
+   dprintf (c->verbose_fd, "close failed with %s (err=%d)\n",
+strerror (errno), errno);
+}
   return cache_config;
 }
 
-- 
2.40.1