Re: Excessive calls to iterate_phdr during exception handling
Hi! On Wed, May 29, 2013 at 12:02:27AM -0400, Ryan Johnson wrote: Note, swapping the order of dl_iterate_phdr and _Unwind_Find_registered_FDE IMHO is fine. > I think what you're saying is that the p_eh_frame_hdr field could > end up with a dangling pointer due to a dlclose call? > > If so, my argument is that, as long as the cache is up to date as of > the start of unwind, any attempt to access a dangling p_eh_frame_hdr > means that in-use code was dlclosed, in which case unwind is > guaranteed to fail anyway. The failure would just have different > symptoms with such a cache in place. But if you don't take a lock and look at the 8 entry cache without holding a lock, then some other thread might be in a middle of updating that structure and so it is definitely unsafe to access it. So the cache to be usable across all _Unwind_Find_FDE calls from the same _Unwind_ForcedUnwind, _Unwind_Resume, _Unwind_Resume_or_Rethrow, _Unwind_Backtrace call would need to be an automatic (set of) variable(s) from those functions or __thread with some way to flush that cache upon uw_install_context resp. return from _Unwind_Backtrace. Note that some of those functions may run some user code for each frame, in theory that code could e.g. dlclose libraries already left, but can't validly free the upper frames, thus perhaps it would be possible to just use such a cache. The big problem is that the unwinder is an ABI minefield, e.g. on some architectures _Unwind_Find_FDE lives in glibc as well as libgcc, and which one is called is unknown. So, to be able to handle this we'd perhaps need to introduce a wrapper around _Unwind_Find_FDE that would be passed an extra argument, ideally hidden in libgcc unwind-dw2-fde.c, verify that _Unwind_Find_FDE binds to the current TU _Unwind_Find_FDE symbol, look at the cache provided by an extra argument (without lock) and just proceed with the second half of _Unwind_IterarePhdrCallback (perhaps moved into a helper function?) in that case. If not found in the cache, do what _Unwind_Find_FDE does, but additionally record the details in the local cache in addition to the global one. If _Unwind_Find_FDE symbol binds to different _Unwind_Find_FDE, it would need to run that _Unwind_Find_FDE as a fallback (which often would mean that dl_iterate_phdr is called again by glibc), but generally that should happen only if .eh_frame is not present. The reason for having more than one _Unwind_Find_FDE is due to the pre .eh_frame_hdr world, there has to be a single registry, otherwise each FDE registry might be told about some subset of libraries, but you might not know all of them. And at some point glibc was also exporting such registry (it linked libgcc.a into it and that at some point also contained the unwinder). Jakub
Re: Excessive calls to iterate_phdr during exception handling
On Wed, May 29, 2013 at 2:47 AM, Ian Lance Taylor wrote: > On Mon, May 27, 2013 at 3:20 PM, Ryan Johnson > wrote: >> >> I have a large C++ app that throws exceptions to unwind anywhere from 5-20 >> stack frames when an error prevents the request from being served (which >> happens rather frequently). Works fine single-threaded, but performance is >> terrible for 24 threads on a 48-thread Ubuntu 10 machine. Profiling points >> to a global mutex acquire in __GI___dl_iterate_phdr as the culprit, with >> _Unwind_Find_FDE as its caller. >> >> Tracing the attached test case with the attached gdb script shows that the >> -DDTOR case executes ~20k instructions during unwind and calls iterate_phdr >> 12 times. The -DTRY case executes ~33k instructions and calls iterate_phdr >> 18 times. The exception in this test case only affects three stack frames, >> with minimal cleanup required, and the trace is taken on the second call to >> the function that swallows the error, to warm up libgcc's internal caches >> [1]. >> >> The instruction counts aren't terribly surprising---I know unwinding is >> complex---but might it be possible to throw and catch a previously-seen >> exception through a previously-seen stack trace, with something fewer than >> 4-6 global mutex acquires for each frame unwound? As it stands, the deeper >> the stack trace (= the more desirable to throw rather than return an error), >> the more of a scalability bottleneck unwinding becomes. My actual app would >> apparently suffer anywhere from 25 to 80 global mutex acquires for each >> exception thrown, which probably explains why the bottleneck arises... >> >> I'm bringing the issue up here, rather than filing a bug, because I'm not >> sure whether this is an oversight, a known problem that's hard to fix, or a >> feature (e.g. somehow required for reliable unwinding). I suspect the >> former, because _Unwind_Find_FDE tries a call to _Unwind_Find_registered_FDE >> before falling back to dl_iterate_phdr, but the former never succeeds in my >> trace (iterate_phdr is always called). > > The issue is dlclose followed by dlopen. If we had a cache ahead of > dl_iterate_phdr, we would need some way to clear out any information > cached from a dlclose'd library. Otherwise we might pick up the old > information when looking up an address from a new dlopen. So 1) > locking will always be required; 2) any caching system to reduce the > number of locks will require support for dlclose, somehow. It's worth > working on but there isn't going to be a simple solution. Maybe a simple solution like the one for threads ... a flag on whether we've seen a dlclose call yet (or on whether libdl is linked in, that is, make libdl provide an alternate implementation that interposes the one from glibc which wouldn't care for that case?) Richard. > Ian
Re: Excessive calls to iterate_phdr during exception handling
On 29/05/2013 3:36 AM, Jakub Jelinek wrote: Hi! On Wed, May 29, 2013 at 12:02:27AM -0400, Ryan Johnson wrote: Note, swapping the order of dl_iterate_phdr and _Unwind_Find_registered_FDE IMHO is fine. I think what you're saying is that the p_eh_frame_hdr field could end up with a dangling pointer due to a dlclose call? If so, my argument is that, as long as the cache is up to date as of the start of unwind, any attempt to access a dangling p_eh_frame_hdr means that in-use code was dlclosed, in which case unwind is guaranteed to fail anyway. The failure would just have different symptoms with such a cache in place. But if you don't take a lock and look at the 8 entry cache without holding a lock, then some other thread might be in a middle of updating that structure and so it is definitely unsafe to access it. The cache would be built from scratch as part of firing up the unwind machinery (and discarded at the end of unwind), or stored in TLS and thus thread-private. A global cache would require something like RCU, which I don't see happening any time soon. So the cache to be usable across all _Unwind_Find_FDE calls from the same _Unwind_ForcedUnwind, _Unwind_Resume, _Unwind_Resume_or_Rethrow, _Unwind_Backtrace call would need to be an automatic (set of) variable(s) from those functions or __thread with some way to flush that cache upon uw_install_context resp. return from _Unwind_Backtrace. Note that some of those functions may run some user code for each frame, in theory that code could e.g. dlclose libraries already left, but can't validly free the upper frames, thus perhaps it would be possible to just use such a cache. The big problem is that the unwinder is an ABI minefield, e.g. on some architectures _Unwind_Find_FDE lives in glibc as well as libgcc, and which one is called is unknown. So, to be able to handle this we'd perhaps need to introduce a wrapper around _Unwind_Find_FDE that would be passed an extra argument, ideally hidden in libgcc unwind-dw2-fde.c, verify that _Unwind_Find_FDE binds to the current TU _Unwind_Find_FDE symbol, look at the cache provided by an extra argument (without lock) and just proceed with the second half of _Unwind_IterarePhdrCallback (perhaps moved into a helper function?) in that case. If not found in the cache, do what _Unwind_Find_FDE does, but additionally record the details in the local cache in addition to the global one. If _Unwind_Find_FDE symbol binds to different _Unwind_Find_FDE, it would need to run that _Unwind_Find_FDE as a fallback (which often would mean that dl_iterate_phdr is called again by glibc), but generally that should happen only if .eh_frame is not present. The reason for having more than one _Unwind_Find_FDE is due to the pre .eh_frame_hdr world, there has to be a single registry, otherwise each FDE registry might be told about some subset of libraries, but you might not know all of them. And at some point glibc was also exporting such registry (it linked libgcc.a into it and that at some point also contained the unwinder). Gotcha. Thanks for the historical context, that helps clear up why things are the way they are right now. And yes, the question of how to make the cache available is a tricky question that I'm not equipped to give an answer to, though the "extra struct member" trick has been used profitably before... Ryan
Re: Excessive calls to iterate_phdr during exception handling
On Tue, May 28, 2013 at 9:02 PM, Ryan Johnson wrote: > > Maybe I misunderstood... there's currently a (very small) cache > (unwind-dw2-fde-dip.c) that lives behind the loader mutex. It contains 8 > entries and each entry holds the start and end addresses for one loaded > object, along with a pointer to the eh_frame_header. The cache resides in > static storage, and so accessing it is always safe. > > I think what you're saying is that the p_eh_frame_hdr field could end up > with a dangling pointer due to a dlclose call? Yes, that can happen. > If so, my argument is that, as long as the cache is up to date as of the > start of unwind, any attempt to access a dangling p_eh_frame_hdr means that > in-use code was dlclosed, in which case unwind is guaranteed to fail anyway. > The failure would just have different symptoms with such a cache in place. > > Am I missing something? I think you're right about that. But what happens if the entry is not in the cache? Or, do you mean you want to look in the cache before calling dl_iterate_phdr? That should be safe but of course you still need a lock as multiple threads can be manipulating the cache at the same time. Ian
Re: Excessive calls to iterate_phdr during exception handling
On Tue, May 28, 2013 at 9:16 PM, Ryan Johnson wrote: > On 29/05/2013 12:01 AM, Ian Lance Taylor wrote: >> >> On Tue, May 28, 2013 at 8:50 PM, Ryan Johnson >> wrote: >>> >>> For example, it should be reasonably safe to let __cxa_allocate_exception >>> call dl_iterate_phdr in order to build a list of object headers valid at >>> the >>> time unwind begins. It already calls malloc, so allocating space for a >>> cache >>> (holding pointers to at most a few dozen object headers) wouldn't be so >>> terrible; __cxa_free_exception could release the space after the dust >>> settles. In order to optimize non-throw uses of unwinding, it might make >>> sense to build the cache somewhere besides __cxa_allocate_exception, but >>> the >>> basic concept doesn't change. >> >> Actually, when everything is working correctly with modern tools on a >> GNU/Linux system, the unwind code never calls malloc. That is a good >> thing, as otherwise it would be unsafe to throw from a signal handler. > > Hmm... How recent is "modern" ? Because the only obvious > __cxa_allocate_exception in the gcc-4.8 sources > (libstdc++-v3/libsupc++/eh_alloc.cc) sure seems to call malloc, and the > traces I got from throwing in code compiled by both gcc-4.6 and gcc-4.8 > (gold linker) show calls to malloc as well. Sorry, you're quite right. I've gotten out of the habit of thinking of C++. The unwind code does not call malloc, and that is important. But actually throwing an exception from C++ does call malloc, and that is problematic when trying to throw from a signal handler. I shouldn't have used that as an example. It's important that the unwind code proper does not call malloc so that we can reliably get stack traces from signal handlers, specifically from the SIGPROF signal. Ian
Re: Excessive calls to iterate_phdr during exception handling
On 29/05/2013 9:41 AM, Ian Lance Taylor wrote: On Tue, May 28, 2013 at 9:02 PM, Ryan Johnson wrote: Maybe I misunderstood... there's currently a (very small) cache (unwind-dw2-fde-dip.c) that lives behind the loader mutex. It contains 8 entries and each entry holds the start and end addresses for one loaded object, along with a pointer to the eh_frame_header. The cache resides in static storage, and so accessing it is always safe. I think what you're saying is that the p_eh_frame_hdr field could end up with a dangling pointer due to a dlclose call? Yes, that can happen. If so, my argument is that, as long as the cache is up to date as of the start of unwind, any attempt to access a dangling p_eh_frame_hdr means that in-use code was dlclosed, in which case unwind is guaranteed to fail anyway. The failure would just have different symptoms with such a cache in place. Am I missing something? I think you're right about that. But what happens if the entry is not in the cache? Or, do you mean you want to look in the cache before calling dl_iterate_phdr? That should be safe but of course you still need a lock as multiple threads can be manipulating the cache at the same time. Per-thread cache, either allocated and populated at the start of every unwind, or maintained in TLS with version checks against the dl adds/subs counts. The former reduces the lock grabbing to one per unwind; the latter would virtually eliminate locking during unwind, but would require changes to libc to expose the adds/subs counts in some lock-free way. And yes, if we can't afford to malloc(), then a fixed-size cache with a few dozen entries should suffice for the vast majority of apps, falling back to dl_iterate_phdr if the chosen cache size is too small. Emacs is the worst kitchen sink I can think of; it links against ~70 shared libraries, which would fit comfortably in ~2kB of tmp space [1]. Or we could use mmap... it's not officially on the async-safe list from posix, but I'd be shocked if it's actually unsafe to call from signal context in practice (and presumably gcc/glibc would be in a position to know that sort of thing). [1] if your target or app can't spare 2kB of RAM for unwind, you probably have plenty of other reasons not to use exceptions or unwinding already... Thoughts? Ryan
Please document `contrib/download_prerequisites'
It would probably be a good idea to mention: contrib/download_prerequisites on this page: http://gcc.gnu.org/install/prerequisites.html and probably on this page: http://gcc.gnu.org/install/download.html Sincerely, Michael Witten
Question about vectorization limit
Hi, In tree-vect-loop.c, it limits the vectorization only to loops that have 2 BBs: /* Inner-most loop. We currently require that the number of BBs is exactly 2 (the header and latch). Vectorizable inner-most loops look like this: (pre-header) | header <+ | || | +--> latch --+ | (exit-bb) */ if (loop->num_nodes != 2) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "not vectorized: control flow in loop."); return NULL; } Any insights why the limit is set to 2? We found that removing this limit actually improve performance for many applications. Thanks, Dehao
Re: Please document `contrib/download_prerequisites'
2013/5/30 Michael Witten : > It would probably be a good idea to mention: > > contrib/download_prerequisites > > on this page: > > http://gcc.gnu.org/install/prerequisites.html > > and probably on this page: > > http://gcc.gnu.org/install/download.html > > Sincerely, > Michael Witten Is this the page you are looking for?? http://gcc.gnu.org/wiki/InstallingGCC See 'Support libraries' section. Best regards, jasonwucj