Re: Excessive calls to iterate_phdr during exception handling

2013-05-29 Thread Jakub Jelinek
Hi!

On Wed, May 29, 2013 at 12:02:27AM -0400, Ryan Johnson wrote:

Note, swapping the order of dl_iterate_phdr and _Unwind_Find_registered_FDE
IMHO is fine.

> I think what you're saying is that the p_eh_frame_hdr field could
> end up with a dangling pointer due to a dlclose call?
> 
> If so, my argument is that, as long as the cache is up to date as of
> the start of unwind, any attempt to access a dangling p_eh_frame_hdr
> means that in-use code was dlclosed, in which case unwind is
> guaranteed to fail anyway. The failure would just have different
> symptoms with such a cache in place.

But if you don't take a lock and look at the 8 entry cache without holding a
lock, then some other thread might be in a middle of updating that
structure and so it is definitely unsafe to access it.

So the cache to be usable across all _Unwind_Find_FDE calls from the same
_Unwind_ForcedUnwind, _Unwind_Resume, _Unwind_Resume_or_Rethrow,
_Unwind_Backtrace call would need to be an automatic (set of) variable(s) from 
those
functions or __thread with some way to flush that cache upon
uw_install_context resp. return from _Unwind_Backtrace.

Note that some of those functions may run some user code for each frame,
in theory that code could e.g. dlclose libraries already left, but can't
validly free the upper frames, thus perhaps it would be possible to just use
such a cache.

The big problem is that the unwinder is an ABI minefield, e.g. on some
architectures _Unwind_Find_FDE lives in glibc as well as libgcc, and which
one is called is unknown.  So, to be able to handle this we'd perhaps need
to introduce a wrapper around _Unwind_Find_FDE that would be passed an extra
argument, ideally hidden in libgcc unwind-dw2-fde.c, verify that
_Unwind_Find_FDE binds to the current TU _Unwind_Find_FDE symbol, 
look at the cache provided by an extra argument (without lock) and just
proceed with the second half of _Unwind_IterarePhdrCallback (perhaps moved
into a helper function?) in that case.  If not found in the cache, do
what _Unwind_Find_FDE does, but additionally record the details in the
local cache in addition to the global one.  If _Unwind_Find_FDE symbol binds
to different _Unwind_Find_FDE, it would need to run that _Unwind_Find_FDE as
a fallback (which often would mean that dl_iterate_phdr is called again by
glibc), but generally that should happen only if .eh_frame is not present.

The reason for having more than one _Unwind_Find_FDE is due to the pre
.eh_frame_hdr world, there has to be a single registry, otherwise each FDE
registry might be told about some subset of libraries, but you might not
know all of them.  And at some point glibc was also exporting such registry
(it linked libgcc.a into it and that at some point also contained the
unwinder).

Jakub


Re: Excessive calls to iterate_phdr during exception handling

2013-05-29 Thread Richard Biener
On Wed, May 29, 2013 at 2:47 AM, Ian Lance Taylor  wrote:
> On Mon, May 27, 2013 at 3:20 PM, Ryan Johnson
>  wrote:
>>
>> I have a large C++ app that throws exceptions to unwind anywhere from 5-20
>> stack frames when an error prevents the request from being served (which
>> happens rather frequently). Works fine single-threaded, but performance is
>> terrible for 24 threads on a 48-thread Ubuntu 10 machine. Profiling points
>> to a global mutex acquire in __GI___dl_iterate_phdr as the culprit, with
>> _Unwind_Find_FDE as its caller.
>>
>> Tracing the attached test case with the attached gdb script shows that the
>> -DDTOR case executes ~20k instructions during unwind and calls iterate_phdr
>> 12 times. The -DTRY case executes ~33k instructions and calls iterate_phdr
>> 18 times. The exception in this test case only affects three stack frames,
>> with minimal cleanup required, and the trace is taken on the second call to
>> the function that swallows the error, to warm up libgcc's internal caches
>> [1].
>>
>> The instruction counts aren't terribly surprising---I know unwinding is
>> complex---but might it be possible to throw and catch a previously-seen
>> exception through a previously-seen stack trace, with something fewer than
>> 4-6 global mutex acquires for each frame unwound? As it stands, the deeper
>> the stack trace (= the more desirable to throw rather than return an error),
>> the more of a scalability bottleneck unwinding becomes. My actual app would
>> apparently suffer anywhere from 25 to 80 global mutex acquires for each
>> exception thrown, which probably explains why the bottleneck arises...
>>
>> I'm bringing the issue up here, rather than filing a bug, because I'm not
>> sure whether this is an oversight, a known problem that's hard to fix, or a
>> feature (e.g. somehow required for reliable unwinding). I suspect the
>> former, because _Unwind_Find_FDE tries a call to _Unwind_Find_registered_FDE
>> before falling back to dl_iterate_phdr, but the former never succeeds in my
>> trace (iterate_phdr is always called).
>
> The issue is dlclose followed by dlopen.  If we had a cache ahead of
> dl_iterate_phdr, we would need some way to clear out any information
> cached from a dlclose'd library.  Otherwise we might pick up the old
> information when looking up an address from a new dlopen.  So 1)
> locking will always be required; 2) any caching system to reduce the
> number of locks will require support for dlclose, somehow.  It's worth
> working on but there isn't going to be a simple solution.

Maybe a simple solution like the one for threads ... a flag on whether
we've seen a dlclose call yet (or on whether libdl is linked in, that is,
make libdl provide an alternate implementation that interposes the
one from glibc which wouldn't care for that case?)

Richard.

> Ian


Re: Excessive calls to iterate_phdr during exception handling

2013-05-29 Thread Ryan Johnson

On 29/05/2013 3:36 AM, Jakub Jelinek wrote:

Hi!

On Wed, May 29, 2013 at 12:02:27AM -0400, Ryan Johnson wrote:

Note, swapping the order of dl_iterate_phdr and _Unwind_Find_registered_FDE
IMHO is fine.


I think what you're saying is that the p_eh_frame_hdr field could
end up with a dangling pointer due to a dlclose call?

If so, my argument is that, as long as the cache is up to date as of
the start of unwind, any attempt to access a dangling p_eh_frame_hdr
means that in-use code was dlclosed, in which case unwind is
guaranteed to fail anyway. The failure would just have different
symptoms with such a cache in place.

But if you don't take a lock and look at the 8 entry cache without holding a
lock, then some other thread might be in a middle of updating that
structure and so it is definitely unsafe to access it.
The cache would be built from scratch as part of firing up the unwind 
machinery (and discarded at the end of unwind), or stored in TLS and 
thus thread-private. A global cache would require something like RCU, 
which I don't see happening any time soon.



So the cache to be usable across all _Unwind_Find_FDE calls from the same
_Unwind_ForcedUnwind, _Unwind_Resume, _Unwind_Resume_or_Rethrow,
_Unwind_Backtrace call would need to be an automatic (set of) variable(s) from 
those
functions or __thread with some way to flush that cache upon
uw_install_context resp. return from _Unwind_Backtrace.

Note that some of those functions may run some user code for each frame,
in theory that code could e.g. dlclose libraries already left, but can't
validly free the upper frames, thus perhaps it would be possible to just use
such a cache.

The big problem is that the unwinder is an ABI minefield, e.g. on some
architectures _Unwind_Find_FDE lives in glibc as well as libgcc, and which
one is called is unknown.  So, to be able to handle this we'd perhaps need
to introduce a wrapper around _Unwind_Find_FDE that would be passed an extra
argument, ideally hidden in libgcc unwind-dw2-fde.c, verify that
_Unwind_Find_FDE binds to the current TU _Unwind_Find_FDE symbol,
look at the cache provided by an extra argument (without lock) and just
proceed with the second half of _Unwind_IterarePhdrCallback (perhaps moved
into a helper function?) in that case.  If not found in the cache, do
what _Unwind_Find_FDE does, but additionally record the details in the
local cache in addition to the global one.  If _Unwind_Find_FDE symbol binds
to different _Unwind_Find_FDE, it would need to run that _Unwind_Find_FDE as
a fallback (which often would mean that dl_iterate_phdr is called again by
glibc), but generally that should happen only if .eh_frame is not present.

The reason for having more than one _Unwind_Find_FDE is due to the pre
.eh_frame_hdr world, there has to be a single registry, otherwise each FDE
registry might be told about some subset of libraries, but you might not
know all of them.  And at some point glibc was also exporting such registry
(it linked libgcc.a into it and that at some point also contained the
unwinder).
Gotcha. Thanks for the historical context, that helps clear up why 
things are the way they are right now.


And yes, the question of how to make the cache available is a tricky 
question that I'm not equipped to give an answer to, though the "extra 
struct member" trick has been used profitably before...


Ryan



Re: Excessive calls to iterate_phdr during exception handling

2013-05-29 Thread Ian Lance Taylor
On Tue, May 28, 2013 at 9:02 PM, Ryan Johnson
 wrote:
>
> Maybe I misunderstood... there's currently a (very small) cache
> (unwind-dw2-fde-dip.c) that lives behind the loader mutex. It contains 8
> entries and each entry holds the start and end addresses for one loaded
> object, along with a pointer to the eh_frame_header. The cache resides in
> static storage, and so accessing it is always safe.
>
> I think what you're saying is that the p_eh_frame_hdr field could end up
> with a dangling pointer due to a dlclose call?

Yes, that can happen.

> If so, my argument is that, as long as the cache is up to date as of the
> start of unwind, any attempt to access a dangling p_eh_frame_hdr means that
> in-use code was dlclosed, in which case unwind is guaranteed to fail anyway.
> The failure would just have different symptoms with such a cache in place.
>
> Am I missing something?

I think you're right about that.  But what happens if the entry is not
in the cache?  Or, do you mean you want to look in the cache before
calling dl_iterate_phdr?  That should be safe but of course you still
need a lock as multiple threads can be manipulating the cache at the
same time.

Ian


Re: Excessive calls to iterate_phdr during exception handling

2013-05-29 Thread Ian Lance Taylor
On Tue, May 28, 2013 at 9:16 PM, Ryan Johnson
 wrote:
> On 29/05/2013 12:01 AM, Ian Lance Taylor wrote:
>>
>> On Tue, May 28, 2013 at 8:50 PM, Ryan Johnson
>>  wrote:
>>>
>>> For example, it should be reasonably safe to let __cxa_allocate_exception
>>> call dl_iterate_phdr in order to build a list of object headers valid at
>>> the
>>> time unwind begins. It already calls malloc, so allocating space for a
>>> cache
>>> (holding pointers to at most a few dozen object headers) wouldn't be so
>>> terrible; __cxa_free_exception could release the space after the dust
>>> settles. In order to optimize non-throw uses of unwinding, it might make
>>> sense to build the cache somewhere besides __cxa_allocate_exception, but
>>> the
>>> basic concept doesn't change.
>>
>> Actually, when everything is working correctly with modern tools on a
>> GNU/Linux system, the unwind code never calls malloc.  That is a good
>> thing, as otherwise it would be unsafe to throw from a signal handler.
>
> Hmm... How recent is "modern" ? Because the only obvious
> __cxa_allocate_exception in the gcc-4.8 sources
> (libstdc++-v3/libsupc++/eh_alloc.cc) sure seems to call malloc, and the
> traces I got from throwing in code compiled by both gcc-4.6 and gcc-4.8
> (gold linker) show calls to malloc as well.

Sorry, you're quite right.  I've gotten out of the habit of thinking
of C++.  The unwind code does not call malloc, and that is important.
But actually throwing an exception from C++ does call malloc, and that
is problematic when trying to throw from a signal handler.  I
shouldn't have used that as an example.

It's important that the unwind code proper does not call malloc so
that we can reliably get stack traces from signal handlers,
specifically from the SIGPROF signal.

Ian


Re: Excessive calls to iterate_phdr during exception handling

2013-05-29 Thread Ryan Johnson

On 29/05/2013 9:41 AM, Ian Lance Taylor wrote:

On Tue, May 28, 2013 at 9:02 PM, Ryan Johnson
 wrote:

Maybe I misunderstood... there's currently a (very small) cache
(unwind-dw2-fde-dip.c) that lives behind the loader mutex. It contains 8
entries and each entry holds the start and end addresses for one loaded
object, along with a pointer to the eh_frame_header. The cache resides in
static storage, and so accessing it is always safe.

I think what you're saying is that the p_eh_frame_hdr field could end up
with a dangling pointer due to a dlclose call?

Yes, that can happen.


If so, my argument is that, as long as the cache is up to date as of the
start of unwind, any attempt to access a dangling p_eh_frame_hdr means that
in-use code was dlclosed, in which case unwind is guaranteed to fail anyway.
The failure would just have different symptoms with such a cache in place.

Am I missing something?

I think you're right about that.  But what happens if the entry is not
in the cache?  Or, do you mean you want to look in the cache before
calling dl_iterate_phdr?  That should be safe but of course you still
need a lock as multiple threads can be manipulating the cache at the
same time.
Per-thread cache, either allocated and populated at the start of every 
unwind, or maintained in TLS with version checks against the dl 
adds/subs counts. The former reduces the lock grabbing to one per 
unwind; the latter would virtually eliminate locking during unwind, but 
would require changes to libc to expose the adds/subs counts in some 
lock-free way.


And yes, if we can't afford to malloc(), then a fixed-size cache with a 
few dozen entries should suffice for the vast majority of apps, falling 
back to dl_iterate_phdr if the chosen cache size is too small. Emacs is 
the worst kitchen sink I can think of; it links against ~70 shared 
libraries, which would fit comfortably in ~2kB of tmp space [1]. Or we 
could use mmap... it's not officially on the async-safe list from posix, 
but I'd be shocked if it's actually unsafe to call from signal context 
in practice (and presumably gcc/glibc would be in a position to know 
that sort of thing).


[1] if your target or app can't spare 2kB of RAM for unwind, you 
probably have plenty of other reasons not to use exceptions or unwinding 
already...


Thoughts?
Ryan



Please document `contrib/download_prerequisites'

2013-05-29 Thread Michael Witten
It would probably be a good idea to mention:

  contrib/download_prerequisites

on this page:

  http://gcc.gnu.org/install/prerequisites.html

and probably on this page:

  http://gcc.gnu.org/install/download.html

Sincerely,
Michael Witten


Question about vectorization limit

2013-05-29 Thread Dehao Chen
Hi,

In tree-vect-loop.c, it limits the vectorization only to loops that have 2 BBs:

  /* Inner-most loop.  We currently require that the number of BBs is
 exactly 2 (the header and latch).  Vectorizable inner-most loops
 look like this:

(pre-header)
   |
  header <+
   | ||
   | +--> latch --+
   |
(exit-bb)  */

  if (loop->num_nodes != 2)
{
  if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 "not vectorized: control flow in loop.");
  return NULL;
}

Any insights why the limit is set to 2? We found that removing this
limit actually improve performance for many applications.

Thanks,
Dehao


Re: Please document `contrib/download_prerequisites'

2013-05-29 Thread Chung-Ju Wu
2013/5/30 Michael Witten :
> It would probably be a good idea to mention:
>
>   contrib/download_prerequisites
>
> on this page:
>
>   http://gcc.gnu.org/install/prerequisites.html
>
> and probably on this page:
>
>   http://gcc.gnu.org/install/download.html
>
> Sincerely,
> Michael Witten

Is this the page you are looking for??

http://gcc.gnu.org/wiki/InstallingGCC

See 'Support libraries' section.


Best regards,
jasonwucj