[PATCH 3/3 v4] src/readelf.c: Support concurrency for -w, --debug-dump

2025-06-29 Thread Aaron Merey
Implement concurrent execution of print_debug_* functions during handling
of -w, --debug-dump using libthread.a.

A new `-C, --concurrency=NUM` command line option controls the maximum
number of threads that may be used. This value defaults to the number of CPUs.

Job output is buffered and printed in the order that jobs were added to
the queue. This helps preserve the existing order of stdout output.

* src/readelf.c (default_concurrency): Function estimating the
maximum number of threads.
(parse_opt): Handle -C, --concurrency=NUM.
(do_job): Entry point function for threads.
(schedule_job): If thread safety is enabled, add job to the
job queue.  Otherwise just run the job from the main thread.
(print_debug): Pass print_debug_* function pointers and
args to schedule_job. Also call run_jobs if thread safety
is enabled.

Signed-off-by: Aaron Merey 

---
v4: allow num_thread jobs instead of num_thread - 1

heap alloc job_data

Use max_threads instead of max_threads - 1 for determining the maximum
number of jobs running concurrently.

Improve error output when the argument to -C is invalid.

Allocate the job_data on the heap instead of the stack to avoid stack overflow
if there are a very large number of debug sections.

> On Tue, May 27, 2025 at 10:04:20AM -0400, Aaron Merey wrote:
> > Implement concurrent execution of print_debug_* functions during handling
> > of -w, --debug-dump using libthread.a.
>
> Do you intend to also enable this for other sections?

Yes I will extend parallelization to non-debug sections in future
patches.

> > +/* Estimate the maximum number of threads. This is normally
> > +   #CPU.  Return value is guaranteed to be at least 1.  */
> > +static int
> > +default_concurrency (void)
> > +{
> > +  unsigned aff = 0;
> > +#ifdef HAVE_SCHED_GETAFFINITY
> > +  {
> > +    int ret;
> > +    cpu_set_t mask;
> > +    CPU_ZERO (&mask);
> > +    ret = sched_getaffinity (0, sizeof(mask), &mask);
> > +    if (ret == 0)
> > +      aff = CPU_COUNT (&mask);
> > +  }
> > +#endif
>
> Is this better/worse than using get_nprocs or sysconf
> (_SC_NPROCESSORS_ONLN)?
> https://www.gnu.org/software/libc/manual/html_node/Processor-Resources.html
>
> Also from the sched_getaffinity man page it looks like cpu_set_t might
> be too small and the call might fail on systems with > 1024
> cores. Maybe this is a non-issue though?
>
> > +  unsigned fn = 0;
> > +#ifdef HAVE_GETRLIMIT
> > +  {
> > +    struct rlimit rlim;
> > +    int rc = getrlimit (RLIMIT_NOFILE, &rlim);
> > +    if (rc == 0)
> > +      fn = MAX ((rlim_t) 1, (rlim.rlim_cur - 100) / 2);
> > +    /* Conservatively estimate that at least 2 fds are used
> > +       by each thread.  */
> > +  }
> > +#endif
>
> Interesting. I wouldn't have thought about that. But makes sense. If
> you check available file descriptors, would it also make sense to
> check for available memory? And limit the number of jobs to mem/1024M?

This function is copied from debuginfod.cxx where we haven't run into
issues with >1024 cores or the number of jobs exceeding mem/1024M.

> >  /* To store the name used in compare_listptr */
> > -static const char *sort_listptr_name;
> > +_Thread_local const char *sort_listptr_name;
>
> O, that is interesting and not in your ChangeLog.
> Is this the simplest way to make this thread-safe?
> Could there be a lock around sort_listptr ?
> What about the static listptr_table known_[ptr|bases]?
> Should building up/accessing those be made thread-safe?

In the current implementation print_debug_info_section, which
populates the listptr_tables, runs on its own before add_jobs
and run_jobs are called.  After that each listptr_table is used
only within one print_debug_* function, so there isn't a risk
of race conditions at the moment.

When we extend eu-readelf paralellization in future patches
print_debug_info_section will run in paralell with other
print_debug_* functions and the listptr_tables will be protected
by locks.

 src/readelf.c | 164 --
 1 file changed, 160 insertions(+), 4 deletions(-)

diff --git a/src/readelf.c b/src/readelf.c
index efb445ed..11082830 100644
--- a/src/readelf.c
+++ b/src/readelf.c
@@ -57,6 +57,18 @@
 
 #include "../libdw/known-dwarf.h"
 
+#ifdef USE_LOCKS
+#include "threadlib.h"
+#endif
+
+#ifdef HAVE_SCHED_H
+#include 
+#endif
+
+#ifdef HAVE_SYS_RESOURCE_H
+#include 
+#endif
+
 #ifdef __linux__
 #define CORE_SIGILL  SIGILL
 #define CORE_SIGBUS  SIGBUS
@@ -150,6 +162,10 @@ static const struct argp_option options[] =
 N_("Ignored for compatibility (lines always wide)"), 0 },
   { "decompress", 'z', NULL, 0,
 N_("Show compression information for compressed sections (when used with 
-S); decompress section before dumping data (when used with -p or -x)"), 0 },
+#ifdef USE_LOCKS
+  { "concurrency", 'C', "NUM", 0,
+N_("Set maximum number of threads. Defaults to the number of CPUs."), 0 },
+#endi

[PATCH] readelf.c: Avoid repeating calls to gettext _() in hotpath

2025-06-29 Thread Aaron Merey
Calls to the gettext _() macro in hotpaths cause many runtime lookups
of the same translation strings.  This patch introduces macro __()
which caches the result of _() at each call site where __() is used.

When a cached translation string is used as the format string for
fprintf, the compiler might raise a -Wformat-nonliteral warning.

To avoid this, the -Wformat-nonliteral warning is suppressed using _Pragma
around affected fprintf calls, using IGNORE_FMT_NONLITERAL_BEGIN and
_END wrappers.

To avoid suppressing -Wformat-nonliteral across every printf-style
function using _(), __() and IGNORE_FMT_NONLITERAL_* are mostly used
within hotpath loops where caching makes a signficant difference.
Runtime improvements of up to 13% faster have been observed with this patch
applied.

Signed-off-by: Aaron Merey 
---
 src/readelf.c | 222 ++
 1 file changed, 169 insertions(+), 53 deletions(-)

diff --git a/src/readelf.c b/src/readelf.c
index 11082830..fdea1941 100644
--- a/src/readelf.c
+++ b/src/readelf.c
@@ -438,6 +438,24 @@ default_concurrency (void)
 }
 #endif
 
+#include 
+
+/* Cache the result of a call to the _() macro at the callsite.  */
+#define __(str) \
+  ({\
+static _Atomic char *__p;   \
+(char *) atomic_load (&__p) \
+  ?: (atomic_store (&__p, (_Atomic char *) _(str)), \
+  (char *) atomic_load (&__p)); \
+  })
+
+#define IGNORE_FMT_NONLITERAL_BEGIN \
+  _Pragma("GCC diagnostic push")\
+  _Pragma("GCC diagnostic ignored \"-Wformat-nonliteral\"")
+
+#define IGNORE_FMT_NONLITERAL_END   \
+  _Pragma ("GCC diagnostic pop")
+
 int
 main (int argc, char *argv[])
 {
@@ -4942,7 +4960,9 @@ print_block (size_t n, const void *block, FILE *out)
 fputs (_("empty block\n"), out);
   else
 {
-  fprintf (out, _("%zu byte block:"), n);
+  IGNORE_FMT_NONLITERAL_BEGIN
+  fprintf (out, __("%zu byte block:"), n);
+  IGNORE_FMT_NONLITERAL_END
   const unsigned char *data = block;
   do
fprintf (out, " %02x", *data++);
@@ -5866,11 +5886,13 @@ print_debug_abbrev_section (Dwfl_Module *dwflmod 
__attribute__ ((unused)),
  unsigned int tag = dwarf_getabbrevtag (&abbrev);
  int has_children = dwarf_abbrevhaschildren (&abbrev);
 
+ IGNORE_FMT_NONLITERAL_BEGIN
  fprintf (out, _(" [%5u] offset: %" PRId64
   ", children: %s, tag: %s\n"),
   code, (int64_t) offset,
   has_children ? yes_str : no_str,
   dwarf_tag_name (tag));
+ IGNORE_FMT_NONLITERAL_END
 
  size_t cnt = 0;
  unsigned int name;
@@ -6210,7 +6232,10 @@ print_debug_aranges_section (Dwfl_Module *dwflmod 
__attribute__ ((unused)),
   const unsigned char *hdrstart = readp;
   size_t start_offset = hdrstart - (const unsigned char *) data->d_buf;
 
-  fprintf (out, _("\nTable at offset %zu:\n"), start_offset);
+  IGNORE_FMT_NONLITERAL_BEGIN
+  fprintf (out, __("\nTable at offset %zu:\n"), start_offset);
+  IGNORE_FMT_NONLITERAL_END
+
   if (readp + 4 > readendp)
{
invalid_data:
@@ -6230,8 +6255,11 @@ print_debug_aranges_section (Dwfl_Module *dwflmod 
__attribute__ ((unused)),
}
 
   const unsigned char *nexthdr = readp + length;
-  fprintf (out, _("\n Length:%6" PRIu64 "\n"),
+
+  IGNORE_FMT_NONLITERAL_BEGIN
+  fprintf (out, __("\n Length:%6" PRIu64 "\n"),
   (uint64_t) length);
+  IGNORE_FMT_NONLITERAL_END
 
   if (unlikely (length > (size_t) (readendp - readp)))
goto invalid_data;
@@ -6242,8 +6270,12 @@ print_debug_aranges_section (Dwfl_Module *dwflmod 
__attribute__ ((unused)),
   if (readp + 2 > readendp)
goto invalid_data;
   uint_fast16_t version = read_2ubyte_unaligned_inc (dbg, readp);
-  fprintf (out, _(" DWARF version: %6" PRIuFAST16 "\n"),
+
+  IGNORE_FMT_NONLITERAL_BEGIN
+  fprintf (out, __(" DWARF version: %6" PRIuFAST16 "\n"),
   version);
+  IGNORE_FMT_NONLITERAL_END
+
   if (version != 2)
{
  error (0, 0, _("unsupported aranges version"));
@@ -6257,14 +6289,21 @@ print_debug_aranges_section (Dwfl_Module *dwflmod 
__attribute__ ((unused)),
offset = read_8ubyte_unaligned_inc (dbg, readp);
   else
offset = read_4ubyte_unaligned_inc (dbg, readp);
-  fprintf (out, _(" CU offset: %6" PRIx64 "\n"),
+
+  IGNORE_FMT_NONLITERAL_BEGIN
+  fprintf (out, __(" CU offset: %6" PRIx64 "\n"),
   (uint64_t) offset);
+  IGNORE_FMT_NONLITERAL_END
 
   if (readp + 1 > readendp)
goto invalid_data;
   unsigned int address_size 

[PATCH 1/3 v3] src: Add threadlib library for parallel job execution

2025-06-29 Thread Aaron Merey
Add new internal static library libthread.a that provides infrastructure
for eu-* tools to run functions concurrently using pthreads.

threadlib.c manages per-job threads as well as per-job buffers for stdout
output.  Output for each job is printed to stdout in the order that the
jobs were added to the job queue.  This helps preserve the order of
output when parallelization is added to an eu-* tool.

threadlib.h declares functions add_job and run_jobs. Jobs are added to
a threadlib.c internal job queue using add_job. run_jobs concurrently
executes jobs in parallel.

eu-readelf now links against libthread.a when elfutils is configured
with --enable-thread-safety.

* src/Makefile.am: libthread.a is compiled and and linked with
readelf when USE_LOCKS is defined.
* src/threadlib.c: New file. Manages job creation, concurrent
execution and output handling.
* src/threadlib.h: New file. Declares functions add_job and
run_jobs.

Signed-off-by: Aaron Merey 

---
v3:
Document that add_job cannot be called during run_jobs.

Add noinst_HEADERS and EXTRA_libtreahd_a_DEPENDENCIES to src/Makefile.am

Replace a printf call with fwrite.

Add a comment clarifying the use of NULL in the job queue.

The main thread now waits on a condition variable after creating worker
threads.  A worker thread will signal this condition variable when a job
completes.

> On Thu, May 22, 2025 at 06:28:52PM -0400, Aaron Merey wrote:
> > +typedef enum {
> > +  /* pthread_create has not been called.  */
> > +  NOT_STARTED,
> > +
> > +  /* pthread_create has been called.  */
> > +  STARTED,
> > +
> > +  /* The thread has finished running the job but has not been joined.  */
> > +  DONE,
> > +
> > +  /* pthread_join has been called.  */
> > +  JOINED
> > +} thread_state_t;
>
> Do we have to think about the state of other possibly failed jobs?
> Or does a failed job just call exit and that terminates everything?

Yes exit will terminate the whole program in this case.

 src/Makefile.am |  16 +++
 src/threadlib.c | 275 
 src/threadlib.h |  36 +++
 3 files changed, 327 insertions(+)
 create mode 100644 src/threadlib.c
 create mode 100644 src/threadlib.h

diff --git a/src/Makefile.am b/src/Makefile.am
index 4a3fb957..f041d458 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -52,6 +52,19 @@ libar.manifest: $(libar_a_OBJECTS)
 MOSTLYCLEANFILES = *.gconv
 CLEANFILES = $(bin_SCRIPTS) $(EXTRA_libar_a_DEPENDENCIES)
 
+if USE_LOCKS
+noinst_LIBRARIES += libthread.a
+noinst_HEADERS = threadlib.h
+libthread_a_SOURCES = threadlib.c
+EXTRA_DIST += threadlib.h
+EXTRA_libthread_a_DEPENDENCIES = libthread.manifest
+
+libthread.manifest: $(libthread_a_OBJECTS)
+   $(AM_V_GEN)echo $^ > $@
+
+CLEANFILES += $(EXTRA_libthread_a_DEPENDENCIES)
+endif
+
 if BUILD_STATIC
 libasm = ../libasm/libasm.a
 libdw = ../libdw/libdw.a -lz $(zip_LIBS) $(libelf) -ldl -lpthread
@@ -91,6 +104,9 @@ ar_no_Wstack_usage = yes
 unstrip_no_Wstack_usage = yes
 
 readelf_LDADD = $(libdw) $(libebl) $(libelf) $(libeu) $(argp_LDADD)
+if USE_LOCKS
+readelf_LDADD += libthread.a
+endif
 nm_LDADD = $(libdw) $(libebl) $(libelf) $(libeu) $(argp_LDADD) $(obstack_LIBS) 
\
   $(demanglelib)
 size_LDADD = $(libelf) $(libeu) $(argp_LDADD)
diff --git a/src/threadlib.c b/src/threadlib.c
new file mode 100644
index ..cf89c2e3
--- /dev/null
+++ b/src/threadlib.c
@@ -0,0 +1,275 @@
+/* Functions for running jobs concurrently.
+   Copyright (C) 2025 Red Hat, Inc.
+   This file is part of elfutils.
+
+   This file is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   elfutils is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see .  */
+
+#ifdef HAVE_CONFIG_H
+# include 
+#endif
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "threadlib.h"
+
+/* Dynamic buffer for thread output.  */
+typedef struct {
+  size_t sizeloc;
+  char *buf;
+  FILE *file;
+} output_stream_t;
+
+/* Allocate resources for STREAM.  */
+static void
+init_thread_output_stream (output_stream_t *stream)
+{
+  stream->buf = NULL;
+  stream->sizeloc = 0;
+  stream->file = open_memstream (&(stream->buf), &(stream->sizeloc));
+
+  if (stream->file == NULL)
+error (1, 0, _("cannot open thread output stream"));
+}
+
+/* Print and deallocate resources for STREAM.  */
+static void
+print_thread_output_stream (output_stream_t *stream)
+{
+  /* fclose may update stream->buf.  */
+  if (fclose (stream->file) !=