performance of exception handling

2020-05-11 Thread Thomas Neumann via Gcc
Hi,

I want to improve the performance of C++ exception handling, and I would
like to get some feedback on how to tackle that.

Currently, exception handling scales poorly due to global mutexes when
throwing. This can be seen with a small demo script here:
https://repl.it/repls/DeliriousPrivateProfiler
Using a thread count >1 is much slower than running single threaded.
This global locking is particular painful on a machine with more than a
hundred cores, as there mutexes are expensive and contention becomes
much more likely due to the high degree of parallelism.

Of course conventional wisdom is not to use exceptions when exceptions
can occur somewhat frequently. But I think that is a silly argument, see
the WG21 paper P0709 for a detailed discussion. In particular since
there is no technical reason why they have to be slow, it is just the
current implementation that is slow.

In the current gcc implementation on Linux the bottleneck is
_Unwind_Find_FDE, or more precisely, the function dl_iterate_phdr,
that is called for every frame and that iterates over all shared
libraries while holding a global lock.
That is inherently slow, both due to global locking and due to the data
structures involved.
And it is not easy to speed that up with, e.g., a thread local cache, as
glibc has no mechanism to notify us if a shared library is added or removed.

We therefore need a way to locate the exception frames that is
independent from glibc. One way to achieve that would be to explicitly
register exception frames with __register_frame_info_bases in a
constructor function (and deregister them in a destructor function).
Of course probing explicitly registered frame currently uses a global
lock, too, but that implementation is provided by libgcc, and we can
change that to something better, allowing for lock free reads.
In libgcc explicitly registered frames take precedence over the
dl_iterate_phdr mechanism, which means that we could mix future code
that does call __register_frame_info_bases explicitly with code that
does not. Code that does register will unwind faster than code that does
not, but both can coexist in one process.

Does that sound like a viable strategy to speed up exception handling? I
would be willing to contribute code for that, but I first wanted to know
if you are interested and if the strategy makes sense. Also, my
implementation makes use of atomics, which I hope are available on all
platforms that use unwind-dw2-fde.c, but I am not sure.

Thomas


Re: performance of exception handling

2020-05-11 Thread Thomas Neumann via Gcc
> Link: 
> 
> I'm not sure if your summary is correct.

I was referring to Section 3.2, where Herb says:

"We must remove all technical reasons for a C++ project to disable
exception handling (e.g., by compiler switch) or ban use of exceptions,
in all or part of their project."

In a way I am disagreeing with the paper, of course, in that I propose
to make the existing exception mechanism faster instead of inventing a
new exception mechanism. But what I agree on with P0709 is that it is
unfortunate that many projects disable exceptions due to performance
concerns. And I think the performance problems can be solved.

> My current preferred solution is something that moves the entire code
> that locates the relevant FDE table into glibc.

That is indeed an option, but I have two concerns there. First, it will
lead to code duplication, as libgcc will have to continue to provide its
on implementation on systems with "old" glibcs lacking
__dl_ehframe_find. And second, libgcc has a second lookup mechanism for
__register_frame_info_bases etc., which is needed to JITed code anyway.
And it seems to be attractive to handle that in the same data structure
that also covers the code from executables and shared libraries. Of
course one could move that part to glibc, too. But the code duplication
problems will persist for a long time, as gcc cannot rely upon the
system glibc being new enough to provide __dl_ehframe_find.

Thomas


Re: performance of exception handling

2020-05-11 Thread Thomas Neumann via Gcc
> Not all GCC/G++ targets are GNU/Linux and use GLIBC.  A duplicate
> implementation in GLIBC creates its own set of advantages and
> disadvantages.

so what should I do now? Should I try to move the lookup into GLIBC? Or
handled it within libgcc, as I had originally proposed? Or give up due
to the inertia of a large, grown system?

Another concern is memory consumption. I wanted to store the FDE entries
in a b-tree, which allows for fast lookup and low overhead
synchronization. Memory wise that is not really worse than what we have
today (the "linear" and "erratic" arrays). But the current code has a
fallback for when it is unable to allocate these arrays, falling back to
linear search. Is something like that required? It would make the code
much more complicated (but I got from Moritz mail that some people
really care about memory constrained situations).

Thomas


Re: performance of exception handling

2020-05-12 Thread Thomas Neumann via Gcc
> Some people use exceptions to propagate "low memory" up which
> made me increase the size of the EH emergency pool (which is
> used when malloc cannot even allocate the EH data itself) ...
> 
> So yes, people care.  There absolutely has to be a path in
> unwinding that allocates no (as little as possible) memory.

note that I would not allocate at all in the unwinding path. I would
allocate memory when new frames are registered, but unwinding would be
without any allocations.

Of course there is a trade-off here. We could delay allocating the
lookup structures until the first exception occurs, in order to speed up
programs that never throw any exceptions. But that would effectively
force us to implement a "no memory" fallback, for exactly the reason you
gave, as something like bad_alloc might be the first exception that we
encounter.

Thomas


Re: performance of exception handling

2020-05-12 Thread Thomas Neumann via Gcc
> Just echoing what David said really, but: if the libgcc changes
> are expected to be portable beyond glibc, then the existence of
> an alternative option for glibc shouldn't block the libgcc changes.
> The two approaches aren't be mutually exclusive and each approach
> would achieve something that the other one wouldn't.

to make this discussion a bit less abstract I have implemented a
prototype: https://pastebin.com/KtrPhci2
It is not perfect yet, for example frame de-registration is suboptimal,
but it allows us to speak about an actual implementation with real
performance numbers.

To give some numbers I take my silly example from
https://repl.it/repls/DeliriousPrivateProfiler
with 6 * 1,000,000 function calls, where half of the functions throw,
and I execute it either single threaded or multi-threaded (with 6
threads) on a i7-6800K. Note that the effects are even more dramatic on
larger machines.
The "old" implementation is gcc 9.3., the "new" implementation is gcc
git with the patch linked above. (Note that you have to both use the
patched gcc and use LD_LIBRARY_PATH or similar to force the new libgcc
when repeating the experiment).

The execution times are:

old approach, single threaded: 4.3s
old approach, multi threaded: 6.5s
new approach, single threaded: 3.9s
new approach, multi threaded: 0.7s

This is faster even when single threaded, and it is dramatically faster
when using multiple threads. On machines where atomics are supported
raising an exception no longer uses a global mutex (except for the first
exception after new exception frames were added), and thus exception
processing scales nicely with the threaded count. The code also handles
the out-of-memory condition, falling back to linear search in that case
(just as the old code).

Of course this needs more polishing and testing, but would something
like this be acceptable for gcc? It makes exceptions much more useful in
multi-threaded applications.

Thomas