Re: [PATCH 2/3 v3] src/readelf.c: Add support for print_debug_* output buffering

2025-07-02 Thread Mark Wielaard
On Sun, 2025-06-29 at 23:51 -0400, Aaron Merey wrote:
> Safely handle stdout output during concurrent calls to print_debug_*
> functions.
> 
> For any print_debug_* function and any function that could be called
> from print_debug_* which also prints to stdout: add a FILE * argument
> and replace all printf, puts, putchar with fprintf.  All printing
> to stdout will now be written to this FILE instead.
> 
> The FILE * is an interface to a per-thread dynamically-sized buffer.
> libthread.a manages the allocation, printing and deallocation of
> these buffers.
> 
> Signed-off-by: Aaron Merey 
> 
> ---
> v3: Reorder this patch to be 2/3 in the series instead of 3/3.
> 
> Remove unnecessary handling of null FILE *

Reviewed this first since it seems to apply independently of the rest
of the patches and it seems your other work on gettext adjustments also
depends on this one.

I spot checked the code and ran some quick tests. Looks good.

It should be easy to extend this to non-DWARF-section printing code.

We might want to think about how to better chop up print_debug_units so
that is can more easily to print just one unit at a time (so those can
be parallelized too). The attrcb_args callback support already seems
ready for that.

But that is all for some future.

Cheers,

Mark


Re: [PATCH] readelf.c: Avoid repeating calls to gettext _() in hotpath

2025-07-02 Thread Mark Wielaard
Hi Aaron,

On Sun, 2025-06-29 at 23:51 -0400, Aaron Merey wrote:
> Calls to the gettext _() macro in hotpaths cause many runtime lookups
> of the same translation strings.  This patch introduces macro __()
> which caches the result of _() at each call site where __() is used.
> 
> When a cached translation string is used as the format string for
> fprintf, the compiler might raise a -Wformat-nonliteral warning.
> 
> To avoid this, the -Wformat-nonliteral warning is suppressed using _Pragma
> around affected fprintf calls, using IGNORE_FMT_NONLITERAL_BEGIN and
> _END wrappers.
> 
> To avoid suppressing -Wformat-nonliteral across every printf-style
> function using _(), __() and IGNORE_FMT_NONLITERAL_* are mostly used
> within hotpath loops where caching makes a signficant difference.
> Runtime improvements of up to 13% faster have been observed with this patch
> applied.

I can see how these changes can increase runtime (for threaded runs I
assume). But I really don't like them very much. It adds a lot of
macros/code around a few invocations of _() that probably should just
be removed.

> @@ -4942,7 +4960,9 @@ print_block (size_t n, const void *block, FILE *out)
>  fputs (_("empty block\n"), out);

Side question, is that extra \n deliberate? fputs already adds an \n.

>else
>  {
> -  fprintf (out, _("%zu byte block:"), n);
> +  IGNORE_FMT_NONLITERAL_BEGIN
> +  fprintf (out, __("%zu byte block:"), n);
> +  IGNORE_FMT_NONLITERAL_END
>const unsigned char *data = block;
>do
>   fprintf (out, " %02x", *data++);

So for these why not simply have translated strings in a static var at
the top of print_block:

static char *empty_block_str = _("empty block");
static char *byte_block_str = _("byte block");

And then in the loop do:

  fputs (empty_block_str, out);
[... else ...]
  fprintf (out, "%zu %s:", n, byte_block_str);

> @@ -5866,11 +5886,13 @@ print_debug_abbrev_section (Dwfl_Module *dwflmod 
> __attribute__ ((unused)),
> unsigned int tag = dwarf_getabbrevtag (&abbrev);
> int has_children = dwarf_abbrevhaschildren (&abbrev);
>  
> +   IGNORE_FMT_NONLITERAL_BEGIN
> fprintf (out, _(" [%5u] offset: %" PRId64
>  ", children: %s, tag: %s\n"),
>  code, (int64_t) offset,
>  has_children ? yes_str : no_str,
>  dwarf_tag_name (tag));
> +   IGNORE_FMT_NONLITERAL_END

Why is this necessary here? You aren't using __().
But if you do want this for the "offset", "children" and "tag" strings
I would handle them just like yes_str and no_str are handled here.

> @@ -6210,7 +6232,10 @@ print_debug_aranges_section (Dwfl_Module *dwflmod 
> __attribute__ ((unused)),
>const unsigned char *hdrstart = readp;
>size_t start_offset = hdrstart - (const unsigned char *) data->d_buf;
>  
> -  fprintf (out, _("\nTable at offset %zu:\n"), start_offset);
> +  IGNORE_FMT_NONLITERAL_BEGIN
> +  fprintf (out, __("\nTable at offset %zu:\n"), start_offset);
> +  IGNORE_FMT_NONLITERAL_END
> +
>if (readp + 4 > readendp)
>   {
>   invalid_data:

Again, if possibly just add a static char *table_at_offset_str =
_("table_at_offset");

> @@ -6230,8 +6255,11 @@ print_debug_aranges_section (Dwfl_Module *dwflmod 
> __attribute__ ((unused)),
>   }
>  
>const unsigned char *nexthdr = readp + length;
> -  fprintf (out, _("\n Length:%6" PRIu64 "\n"),
> +
> +  IGNORE_FMT_NONLITERAL_BEGIN
> +  fprintf (out, __("\n Length:%6" PRIu64 "\n"),
>  (uint64_t) length);
> +  IGNORE_FMT_NONLITERAL_END
>  
>if (unlikely (length > (size_t) (readendp - readp)))
>   goto invalid_data;

"Length" can probably also be translated once and then reused
everywhere.

> @@ -6242,8 +6270,12 @@ print_debug_aranges_section (Dwfl_Module *dwflmod 
> __attribute__ ((unused)),
>if (readp + 2 > readendp)
>   goto invalid_data;
>uint_fast16_t version = read_2ubyte_unaligned_inc (dbg, readp);
> -  fprintf (out, _(" DWARF version: %6" PRIuFAST16 "\n"),
> +
> +  IGNORE_FMT_NONLITERAL_BEGIN
> +  fprintf (out, __(" DWARF version: %6" PRIuFAST16 "\n"),
>  version);
> +  IGNORE_FMT_NONLITERAL_END
> +
>if (version != 2)
>   {
> error (0, 0, _("unsupported aranges version"));

Same for "version". I wouldn't try to translate "DWARF".

> @@ -6257,14 +6289,21 @@ print_debug_aranges_section (Dwfl_Module *dwflmod 
> __attribute__ ((unused)),
>   offset = read_8ubyte_unaligned_inc (dbg, readp);
>else
>   offset = read_4ubyte_unaligned_inc (dbg, readp);
> -  fprintf (out, _(" CU offset: %6" PRIx64 "\n"),
> +
> +  IGNORE_FMT_NONLITERAL_BEGIN
> +  fprintf (out, __(" CU offset: %6" PRIx64 "\n"),
>  (uint64_t) offset);
> +  IGNORE_FMT_NONLITERAL_END
>  
>if (readp + 1 > readendp)

Just translate "offset" once, keep CU (don't t

Re: [PATCH 1/3 v3] src: Add threadlib library for parallel job execution

2025-07-02 Thread Mark Wielaard
Hi Aaron,

On Sun, 2025-06-29 at 23:51 -0400, Aaron Merey wrote:
> Add new internal static library libthread.a that provides infrastructure
> for eu-* tools to run functions concurrently using pthreads.
> 
> threadlib.c manages per-job threads as well as per-job buffers for stdout
> output.  Output for each job is printed to stdout in the order that the
> jobs were added to the job queue.  This helps preserve the order of
> output when parallelization is added to an eu-* tool.
> 
> threadlib.h declares functions add_job and run_jobs. Jobs are added to
> a threadlib.c internal job queue using add_job. run_jobs concurrently
> executes jobs in parallel.
> 
> eu-readelf now links against libthread.a when elfutils is configured
> with --enable-thread-safety.
> 
>   * src/Makefile.am: libthread.a is compiled and and linked with
>   readelf when USE_LOCKS is defined.
>   * src/threadlib.c: New file. Manages job creation, concurrent
>   execution and output handling.
>   * src/threadlib.h: New file. Declares functions add_job and
>   run_jobs.
> 
> Signed-off-by: Aaron Merey 
> 
> ---
> v3:
> Document that add_job cannot be called during run_jobs.

Now that we have that clear and I read the other patches I do think
this is a limitation that we might want to address somehow. For example
it seems nice to be able to parallize the processing of debug_info CUs
or the debug_line tables. So it might be nice for a job to be able to
create subjobs just for its own stream. But this doesn't need to be
resolved right now. The threadlib is an internal library, so we can
just change it however we want in the future.

> Add noinst_HEADERS and EXTRA_libtreahd_a_DEPENDENCIES to src/Makefile.am
> 
> Replace a printf call with fwrite.
> 
> Add a comment clarifying the use of NULL in the job queue.
> 
> The main thread now waits on a condition variable after creating worker
> threads.  A worker thread will signal this condition variable when a job
> completes.

This looks good. I think with the mutex and condition you no longer
need the atomic stores (they all seem to happen under the mutex now?).
Although it cannot hurt.

> > On Thu, May 22, 2025 at 06:28:52PM -0400, Aaron Merey wrote:
> > > +typedef enum {
> > > +  /* pthread_create has not been called.  */
> > > +  NOT_STARTED,
> > > +
> > > +  /* pthread_create has been called.  */
> > > +  STARTED,
> > > +
> > > +  /* The thread has finished running the job but has not been joined.  */
> > > +  DONE,
> > > +
> > > +  /* pthread_join has been called.  */
> > > +  JOINED
> > > +} thread_state_t;
> > 
> > Do we have to think about the state of other possibly failed jobs?
> > Or does a failed job just call exit and that terminates everything?
> 
> Yes exit will terminate the whole program in this case.

OK. Then we should make sure that they output their warning/error
message to stderr and not their current stream (because that will be
lost). I think that happens automatically when using error() to exit.

Cheers,

Mark


Re: [PATCH 3/3 v4] src/readelf.c: Support concurrency for -w, --debug-dump

2025-07-02 Thread Mark Wielaard
Hi Aaron,

On Sun, Jun 29, 2025 at 11:51:58PM -0400, Aaron Merey wrote:
> Implement concurrent execution of print_debug_* functions during handling
> of -w, --debug-dump using libthread.a.
> 
> A new `-C, --concurrency=NUM` command line option controls the maximum
> number of threads that may be used. This value defaults to the number of CPUs.
> 
> Job output is buffered and printed in the order that jobs were added to
> the queue. This helps preserve the existing order of stdout output.
> 
>   * src/readelf.c (default_concurrency): Function estimating the
>   maximum number of threads.
>   (parse_opt): Handle -C, --concurrency=NUM.
>   (do_job): Entry point function for threads.
>   (schedule_job): If thread safety is enabled, add job to the
>   job queue.  Otherwise just run the job from the main thread.
>   (print_debug): Pass print_debug_* function pointers and
>   args to schedule_job. Also call run_jobs if thread safety
>   is enabled.
> 
> Signed-off-by: Aaron Merey 
> 
> ---
> v4: allow num_thread jobs instead of num_thread - 1
> 
> heap alloc job_data
> 
> Use max_threads instead of max_threads - 1 for determining the maximum
> number of jobs running concurrently.
> 
> Improve error output when the argument to -C is invalid.
> 
> Allocate the job_data on the heap instead of the stack to avoid stack overflow
> if there are a very large number of debug sections.
> 
> > On Tue, May 27, 2025 at 10:04:20AM -0400, Aaron Merey wrote:
> > > Implement concurrent execution of print_debug_* functions during handling
> > > of -w, --debug-dump using libthread.a.
> >
> > Do you intend to also enable this for other sections?
> 
> Yes I will extend parallelization to non-debug sections in future
> patches.

Nice.

> > > +/* Estimate the maximum number of threads. This is normally
> > > +   #CPU.  Return value is guaranteed to be at least 1.  */
> > > +static int
> > > +default_concurrency (void)
> > > +{
> > > +  unsigned aff = 0;
> > > +#ifdef HAVE_SCHED_GETAFFINITY
> > > +  {
> > > +    int ret;
> > > +    cpu_set_t mask;
> > > +    CPU_ZERO (&mask);
> > > +    ret = sched_getaffinity (0, sizeof(mask), &mask);
> > > +    if (ret == 0)
> > > +      aff = CPU_COUNT (&mask);
> > > +  }
> > > +#endif
> >
> > Is this better/worse than using get_nprocs or sysconf
> > (_SC_NPROCESSORS_ONLN)?
> > https://www.gnu.org/software/libc/manual/html_node/Processor-Resources.html
> >
> > Also from the sched_getaffinity man page it looks like cpu_set_t might
> > be too small and the call might fail on systems with > 1024
> > cores. Maybe this is a non-issue though?
> >
> > > +  unsigned fn = 0;
> > > +#ifdef HAVE_GETRLIMIT
> > > +  {
> > > +    struct rlimit rlim;
> > > +    int rc = getrlimit (RLIMIT_NOFILE, &rlim);
> > > +    if (rc == 0)
> > > +      fn = MAX ((rlim_t) 1, (rlim.rlim_cur - 100) / 2);
> > > +    /* Conservatively estimate that at least 2 fds are used
> > > +       by each thread.  */
> > > +  }
> > > +#endif
> >
> > Interesting. I wouldn't have thought about that. But makes sense. If
> > you check available file descriptors, would it also make sense to
> > check for available memory? And limit the number of jobs to mem/1024M?
> 
> This function is copied from debuginfod.cxx where we haven't run into
> issues with >1024 cores or the number of jobs exceeding mem/1024M.

But is that because it was never run on systems with >1024 core or
with low memory?

If it is common code, maybe it should be part of threadlib that
debuginfod then also uses?

> > >  /* To store the name used in compare_listptr */
> > > -static const char *sort_listptr_name;
> > > +_Thread_local const char *sort_listptr_name;
> >
> > O, that is interesting and not in your ChangeLog.
> > Is this the simplest way to make this thread-safe?
> > Could there be a lock around sort_listptr ?
> > What about the static listptr_table known_[ptr|bases]?
> > Should building up/accessing those be made thread-safe?
> 
> In the current implementation print_debug_info_section, which
> populates the listptr_tables, runs on its own before add_jobs
> and run_jobs are called.  After that each listptr_table is used
> only within one print_debug_* function, so there isn't a risk
> of race conditions at the moment.
> 
> When we extend eu-readelf paralellization in future patches
> print_debug_info_section will run in paralell with other
> print_debug_* functions and the listptr_tables will be protected
> by locks.

Aha. I see I missed the following in the source:

  /* If the .debug_info section is listed as implicitly required then
 we must make sure to handle it before handling any other debug
 section.  Various other sections depend on the CU DIEs being
 scanned (silently) first.  */

And then print_debug_unit has:

  const bool silent = !(print_debug_sections & section_info) && !debug_types;

And even when silent it then does various notice_listptr calls.

Confusing :)

I wonder if it still makes sense to have this

Re: [PATCH] readelf.c: Avoid repeating calls to gettext _() in hotpath

2025-07-02 Thread Henning Meyer
Btw, I would really like to add colored output to readelf, but the way 
gettext is used currently, translating raw printf format strings instead 
of actual words makes that impossible without breaking all translations.


From a security perspective, _("%zu byte block") allows a translator to 
reorder word order for other languages, but it also allows to translate 
%zu to %s or %n for crashes.


On 02.07.25 15:27, Mark Wielaard wrote:

Hi Aaron,

On Sun, 2025-06-29 at 23:51 -0400, Aaron Merey wrote:

Calls to the gettext _() macro in hotpaths cause many runtime lookups
of the same translation strings.  This patch introduces macro __()
which caches the result of _() at each call site where __() is used.

When a cached translation string is used as the format string for
fprintf, the compiler might raise a -Wformat-nonliteral warning.

To avoid this, the -Wformat-nonliteral warning is suppressed using _Pragma
around affected fprintf calls, using IGNORE_FMT_NONLITERAL_BEGIN and
_END wrappers.

To avoid suppressing -Wformat-nonliteral across every printf-style
function using _(), __() and IGNORE_FMT_NONLITERAL_* are mostly used
within hotpath loops where caching makes a signficant difference.
Runtime improvements of up to 13% faster have been observed with this patch
applied.

I can see how these changes can increase runtime (for threaded runs I
assume). But I really don't like them very much. It adds a lot of
macros/code around a few invocations of _() that probably should just
be removed.


@@ -4942,7 +4960,9 @@ print_block (size_t n, const void *block, FILE *out)
  fputs (_("empty block\n"), out);

Side question, is that extra \n deliberate? fputs already adds an \n.


else
  {
-  fprintf (out, _("%zu byte block:"), n);
+  IGNORE_FMT_NONLITERAL_BEGIN
+  fprintf (out, __("%zu byte block:"), n);
+  IGNORE_FMT_NONLITERAL_END
const unsigned char *data = block;
do
fprintf (out, " %02x", *data++);

So for these why not simply have translated strings in a static var at
the top of print_block:

static char *empty_block_str = _("empty block");
static char *byte_block_str = _("byte block");

And then in the loop do:

   fputs (empty_block_str, out);
[... else ...]
   fprintf (out, "%zu %s:", n, byte_block_str);


@@ -5866,11 +5886,13 @@ print_debug_abbrev_section (Dwfl_Module *dwflmod 
__attribute__ ((unused)),
  unsigned int tag = dwarf_getabbrevtag (&abbrev);
  int has_children = dwarf_abbrevhaschildren (&abbrev);
  
+	  IGNORE_FMT_NONLITERAL_BEGIN

  fprintf (out, _(" [%5u] offset: %" PRId64
   ", children: %s, tag: %s\n"),
   code, (int64_t) offset,
   has_children ? yes_str : no_str,
   dwarf_tag_name (tag));
+ IGNORE_FMT_NONLITERAL_END

Why is this necessary here? You aren't using __().
But if you do want this for the "offset", "children" and "tag" strings
I would handle them just like yes_str and no_str are handled here.


@@ -6210,7 +6232,10 @@ print_debug_aranges_section (Dwfl_Module *dwflmod 
__attribute__ ((unused)),
const unsigned char *hdrstart = readp;
size_t start_offset = hdrstart - (const unsigned char *) data->d_buf;
  
-  fprintf (out, _("\nTable at offset %zu:\n"), start_offset);

+  IGNORE_FMT_NONLITERAL_BEGIN
+  fprintf (out, __("\nTable at offset %zu:\n"), start_offset);
+  IGNORE_FMT_NONLITERAL_END
+
if (readp + 4 > readendp)
{
invalid_data:

Again, if possibly just add a static char *table_at_offset_str =
_("table_at_offset");


@@ -6230,8 +6255,11 @@ print_debug_aranges_section (Dwfl_Module *dwflmod 
__attribute__ ((unused)),
}
  
const unsigned char *nexthdr = readp + length;

-  fprintf (out, _("\n Length:%6" PRIu64 "\n"),
+
+  IGNORE_FMT_NONLITERAL_BEGIN
+  fprintf (out, __("\n Length:%6" PRIu64 "\n"),
   (uint64_t) length);
+  IGNORE_FMT_NONLITERAL_END
  
if (unlikely (length > (size_t) (readendp - readp)))

goto invalid_data;

"Length" can probably also be translated once and then reused
everywhere.


@@ -6242,8 +6270,12 @@ print_debug_aranges_section (Dwfl_Module *dwflmod 
__attribute__ ((unused)),
if (readp + 2 > readendp)
goto invalid_data;
uint_fast16_t version = read_2ubyte_unaligned_inc (dbg, readp);
-  fprintf (out, _(" DWARF version: %6" PRIuFAST16 "\n"),
+
+  IGNORE_FMT_NONLITERAL_BEGIN
+  fprintf (out, __(" DWARF version: %6" PRIuFAST16 "\n"),
   version);
+  IGNORE_FMT_NONLITERAL_END
+
if (version != 2)
{
  error (0, 0, _("unsupported aranges version"));

Same for "version". I wouldn't try to translate "DWARF".


@@ -6257,14 +6289,21 @@ print_debug_aranges_section (Dwfl_Module *dwflmod 
__attribute__ ((unused)),
offset = read_8ubyte_unaligned_inc (dbg, readp);
else
offset = read_4ubyte_unaligned_i