Re: Split Stack performance, signals

2015-09-13 Thread Keith Randall
I can tell you a lot about the Go stack-copy implementation, I worked
on a lot of it.  The big drawback is that you need to ensure that you
have accurate stack maps for all frames on the stack.  This is
something we can do in Go with some work but would be more problematic
for C.  Most of that work is necessary for precise garbage collection
(GC) anyway, so any language with precise GC gets this for free.  Some
smaller things:

1) You need to be able to find any pointers from heap->stack
efficiently.  For Go, these were things like defer records, panic
records, and channel queues.

2) You need to be able to find contiguous memory for the stacks.  We
haven't seen this to be a problem in real-world servers.

3) You need to figure out when to shrink a stack.  In Go we shrink
stacks at GC time by finding stacks less than 1/4 used and shrinking
them by half.  Shrinking logic is complicated by the fact that the
shrinkee may be in a syscall, in which case shrinking is not possible
because you might have passed a stack pointer into that syscall.

The big advantage is that you only have to copy once (amortized).  The
stack grows as big as you need it and then you don't need to do any
more work from then on.  This saves lots of morestack calls, but
perhaps more importantly it saves you from having to optimize
everything from morestack on down.  morestack is almost never called,
so you can write it in a high-level language, put lots of debugging
checks in it, etc.

Another advantage is runtime predictability.  Split stacks were not a
large performance problem, but they were a large source of unexplained
performance variance.  We'd often see apparently no-op changes make
runtime vary by +/- 5% because they change frame sizes by just a
little bit and trigger or untrigger the hot split.  So if you test a
code change and get a 4% improvement, is your code really better code,
or did you just get lucky?  Maybe your code change slows things down
by 1% but happens to avoid a hot split.

On Sat, Sep 12, 2015 at 11:38 PM, Anders Oleson  wrote:
> I have been experimenting with -fsplit-stack, in particular related to
> performance issues I have read about in rust and go relative to a "hot
> split" where a function call in a tight loop continuously crosses a
> split. Originally I was interested as an analogy to approximate
> performance issues with other similar stack manipulations. I think the
> original idea is that __morestack would be called extremely rarely.
> Unfortunately there is the edge case that a stack segment boundary
> falls right in the middle of some hot function call.
>
> I wrote a minimal routine called in a loop a few hundred million times
> and I adjusted the stack usages until I found the split. I was really
> surprised to see how bad the hot-split problem can be. Presumably the
> actual allocation is done on the first split. The subsequent ones
> should just link to the next segment "quickly" so the one-time
> allocation cost can be ignored. I'm using an older stock Ubuntu x64
> GCC 4.8.4 btw., but things don't appear to have changed recently.
>
> It is taking >4000 clocks per __morestack call (about 1us on 4GHz Haswell)!
>
> The difference between a non-split call and one with the split prolog
> in the optimistic case where __morestack is not called for comparison
> is < 1 clock. So the prolog is fairly efficient in execution time. It
> is unfortunately large at around 36 bytes per function. I was able to
> nibble away slightly at both the speed and size, but there isn't much
> to gain.
>
> From examining the __morestack code, I found that the sigprocmask
> system call is being called (twice?!) per __morestack, even when it
> should just need to switch to the next allocated segment. I did read
> the reason for that change: to allow signal handlers to be
> split-stack, (ignoring that detail for the moment). A quick experiment
> shows that removing the calls to __morestack_block_signals and
> __morestack_unblock_signals brings the overhead of the hot split down
> to around 60 clocks which is much more reasonable.
>
> However in concept simply switching stack segments *should* not be
> hugely expensive. I made a proof-of-concept that only does the very
> minimal work to switch from one segment to another. This is done in
> assembler (conceptually in __morestack) and eliminates the call out to
> "C" on the likely hot path where the boundary has already been crossed
> and the next segment is already big enough. If you cache the details
> of the next/prev segments (if present) in the space right below the
> bottom (limit) of each stack segment, you can shrink the time down to
> 5-6 clocks. This is probably close to the achievable floor, which was
> in part what I was trying to find out.
>
> Summary:
>   prolog overhead, no call to __morestack : < 1 clock
>   stock call to __morestack (hot): > 4000 clocks
>   without signal blocking: < 60 clocks
>   potential best case: < 6 clocks
>
> I have noticed that both Go a

Replacing malloc with alloca.

2015-09-13 Thread Ajit Kumar Agarwal
All:

The replacement of malloc with alloca can be done on the following analysis.

If the lifetime of an object does not stretch beyond the immediate scope. In 
such cases the malloc can be replaced with alloca.
This increases the performance to a great extent.

Inlining helps to a great extent the scope of lifetime of an object doesn't 
stretch the immediate scope of an object.
And the scope of replacing malloc with alloca can be identified.

I am wondering what phases of our optimization pipeline the malloc is replaced 
with alloca and what analysis is done to transform
The malloc with alloca. This greatly increases the performance of the 
benchmarks? Is the analysis done through Escape Analysis?

If yes, then what data structure is used for the abstract execution 
interpretation?

Thanks & Regards
Ajit


Re: pie in the sky: multi threaded linker

2015-09-13 Thread Manuel López-Ibáñez

On 13/09/15 05:32, Ian Lance Taylor wrote:

On Sat, Sep 12, 2015 at 2:11 PM, David Kunsman  wrote:

Hello...I am thinking about starting to hack on something and I found
https://gcc.gnu.org/wiki/Speedup_areas  and one of the projects is a
multi-threaded linker.  I am just wondering if this is still an area
of interestif it is I plan on starting to work on it.


The gold linker, in the GNU binutils, supports multi-threaded operation.


I have updated the wiki page to reflect this. It would be great if someone with 
more knowledge went over it and did a thorough update.


David, as I said already to you, if you want to update the wiki yourself, I can 
give you editor access.


Cheers,

Manuel.


Re: Replacing malloc with alloca.

2015-09-13 Thread Florian Weimer
* Ajit Kumar Agarwal:

> The replacement of malloc with alloca can be done on the following
> analysis.
>
> If the lifetime of an object does not stretch beyond the immediate
> scope. In such cases the malloc can be replaced with alloca.  This
> increases the performance to a great extent.

You also need to make sure that the object is small (less than a page)
and that there is no deep recursion going on.  Otherwise, the program
may no longer work after the transformation with real-world restricted
stack sizes.  It may even end up with additional security issues.


Re: Split Stack performance, signals

2015-09-13 Thread Ian Lance Taylor
On Sat, Sep 12, 2015 at 11:38 PM, Anders Oleson  wrote:
>
> From examining the __morestack code, I found that the sigprocmask
> system call is being called (twice?!) per __morestack, even when it
> should just need to switch to the next allocated segment. I did read
> the reason for that change: to allow signal handlers to be
> split-stack, (ignoring that detail for the moment). A quick experiment
> shows that removing the calls to __morestack_block_signals and
> __morestack_unblock_signals brings the overhead of the hot split down
> to around 60 clocks which is much more reasonable.

It's not so much allowing signal handlers to be split stack.  It's
handling the case of a signal occurring while the stack is being
split.  That is not so very unlikely, and it will crash your program,
as you get a SIGSEGV while trying to handle whatever signal just
arrived.

gccgo avoids this overhead by marking all signal handlers as
SA_ONSTACK, calling sigaltstack, and calling
__splitstack_block_signals to tell the morestack code that it does not
need to block signals while splitting the stack.


> However in concept simply switching stack segments *should* not be
> hugely expensive. I made a proof-of-concept that only does the very
> minimal work to switch from one segment to another. This is done in
> assembler (conceptually in __morestack) and eliminates the call out to
> "C" on the likely hot path where the boundary has already been crossed
> and the next segment is already big enough. If you cache the details
> of the next/prev segments (if present) in the space right below the
> bottom (limit) of each stack segment, you can shrink the time down to
> 5-6 clocks. This is probably close to the achievable floor, which was
> in part what I was trying to find out.
>
> Summary:
>   prolog overhead, no call to __morestack : < 1 clock
>   stock call to __morestack (hot): > 4000 clocks
>   without signal blocking: < 60 clocks
>   potential best case: < 6 clocks

This sounds great.


> I have noticed that both Go and Rust have now abandoned the split
> stack approach due to performance issues. In Go, the idea to have
> zillions of tiny (go)co-routines or green threads is closer to my
> interest area than the Rust use. Even on x64, I think there are still
> reasons for wanting to break out of needing large linear stacks. Or it
> may be useful to other embedded applications. But in Go, apparently
> the fix is to copy the stack (all of it?) which seems pretty drastic,
> expensive and really tricky. At least it would only happen once. I was
> wondering if there was any thought into doing more work to optimize
> the -fsplit-stack? Does the Go stack-copy implementation have other
> issues?
>
> Another area I didn't explore was that certain leaf and small routines
> with known maximum stack usage could avoid needing the prolog.This
> might ameliorate much of the size issue.
>
> Bottom line is that I don't know whether this is something anyone
> still has any interest in, but in theory at least the "hot-split"
> problem could be improved significantly. At least I learned what I was
> trying to, and I put this out in case it is of use/interest to anyone.

I'm interested in this.  But I'm also interested in moving gccgo to
the stack copying approach.  Copying the stack should be cheaper than
splitting the stack for a long running program.  And, as Keith said,
it makes program performance more predictable, which is valuable in
itself.

Copying the stack requires knowing precisely every pointer on the
stack.  In Go this can be done, since Go has no unions.  In C/C++ it
requires moving all unions that combine pointers with non-pointers off
the stack.

Implementing this in GCC will require adding stackmaps with
pointer/non-pointer bits to the stack unwind information.  We'll need
the information both for the stack frame and for callee-saved
registers.  This will mean that REG_POINTER has to be accurate; I
don't know whether it really is.

Ian


Re: Replacing malloc with alloca.

2015-09-13 Thread Jeff Law

On 09/13/2015 12:28 PM, Florian Weimer wrote:

* Ajit Kumar Agarwal:


The replacement of malloc with alloca can be done on the following
analysis.

If the lifetime of an object does not stretch beyond the immediate
scope. In such cases the malloc can be replaced with alloca.  This
increases the performance to a great extent.


You also need to make sure that the object is small (less than a page)
and that there is no deep recursion going on.  Otherwise, the program
may no longer work after the transformation with real-world restricted
stack sizes.  It may even end up with additional security issues.
You also have to make sure you're not inside a loop.  Even a small 
allocation inside a loop is problematical from a security standpoint.


You also need to look at what other objects might be on the stack and 
you have to look at the functional scope, not the immediate scope as 
alloca space isn't returned until the end of a function.


jeff


Re: Replacing malloc with alloca.

2015-09-13 Thread Florian Weimer
* Jeff Law:

> On 09/13/2015 12:28 PM, Florian Weimer wrote:
>> * Ajit Kumar Agarwal:
>>
>>> The replacement of malloc with alloca can be done on the following
>>> analysis.
>>>
>>> If the lifetime of an object does not stretch beyond the immediate
>>> scope. In such cases the malloc can be replaced with alloca.  This
>>> increases the performance to a great extent.
>>
>> You also need to make sure that the object is small (less than a page)
>> and that there is no deep recursion going on.  Otherwise, the program
>> may no longer work after the transformation with real-world restricted
>> stack sizes.  It may even end up with additional security issues.

> You also have to make sure you're not inside a loop.  Even a small
> allocation inside a loop is problematical from a security standpoint.
>
> You also need to look at what other objects might be on the stack and
> you have to look at the functional scope, not the immediate scope as
> alloca space isn't returned until the end of a function.

Ah, right, alloca is unscoped (except when there are variable-length
arrays).

Using a VLA might be the better approach (but the size concerns
remain).  Introducing VLAs could alter program behavior in case a
pre-existing alloca call, leading to premature deallocation.


Be

2015-09-13 Thread Jimmy Ginn


Sent from my iPhone
Jimmy Ginn


Re: Split Stack performance, signals

2015-09-13 Thread Anders Oleson
> I can tell you a lot about the Go stack-copy implementation, I worked on a 
> lot of it.

Thanks! Keith. I appreciate the insight.

> The big advantage is that you only have to copy once (amortized).  The stack 
> grows as big as you need it and then you don't need to do any more work from 
> then on.  This saves lots of morestack calls, but perhaps more importantly it 
> saves you from having to optimize everything from morestack on down.  
> morestack is almost never called, so you can write it in a high-level 
> language, put lots of debugging checks in it, etc.

Note: In my experiment, I only simulated pushing the hottest path out
of generic_morestack in C into assembly, more or less by pre-staging
all the information all the information necessary to flip from one
segment and the next already valid one with a simple assembler stub
that just interprets it or bails out to a more generic_morestack. I
ended up with about 15 assembler instructions; the test is actually
quite a bit smaller than the current morestack.S. I definitely agree
that the main work should be in a HLL, perhaps even more than in the
current state.

> Another advantage is runtime predictability.  Split stacks were not a large 
> performance problem, but they were a large source of unexplained performance 
> variance.  We'd often see apparently no-op changes make runtime vary by +/- 
> 5% because they change frame sizes by just a little bit and trigger or 
> untrigger the hot split.  So if you test a code change and get a 4% 
> improvement, is your code really better code, or did you just get lucky?  
> Maybe your code change slows things down by 1% but happens to avoid a hot 
> split.

This is a really interesting insight. Clearly the problem would be the
same for all GCC programs using -fsplit-stack, not just Go. What this
means is that you could make a change to some unrelated part of the
program, perhaps even reduce stack usage in some other routine, and
the overhead of some non-inline call somewhere completely else goes
from < 1 clock to > 4000. I think that +/- 5% fluctuation observed in
the wild is just related to the hot-split not being pathologically bad
(i.e. a fast function called in a loop). The natural call graph just
happens to be spanning the split or not as it likes.

But in the worst case the overall program execution time could go up
over 100:1 (presuming the function was a minimal ~40 clocks). That's a
hang, not a performance issue. Even without the system calls, it could
cause the overall program time to double or triple in the pathological
worst case.

I was really surprised to find that what looks like a function call
sometimes resulted in a couple system calls. I think that the signal
issue might be able to be handled a different way. This would at least
reduce the overhead by two orders of magnitude. I just read Ian's
response now and I hadn't noticed the __splitstack_block_signals
trick.

Now, does Go still use the same function prologs? I presume it does
and has just replaces __morestack. The other problem is that the
prolog itself is quite big. I looked at the average size of functions
for example in vmlinux and it was around 216 bytes. So 36 additional
bytes is around 16% overhead. I wouldn't worry about that too much
except for its effect on i-cache.

I think the mitigation for this is to avoid the prolog on leaf
functions. I notice that at the moment even leaf functions that use
the red-zone optimization and don't touch %rsp still get the prolog
applied, which I think is unnecessary.

-- Anders

On Sun, Sep 13, 2015 at 12:17 AM, Keith Randall  wrote:
> I can tell you a lot about the Go stack-copy implementation, I worked
> on a lot of it.  The big drawback is that you need to ensure that you
> have accurate stack maps for all frames on the stack.  This is
> something we can do in Go with some work but would be more problematic
> for C.  Most of that work is necessary for precise garbage collection
> (GC) anyway, so any language with precise GC gets this for free.  Some
> smaller things:
>
> 1) You need to be able to find any pointers from heap->stack
> efficiently.  For Go, these were things like defer records, panic
> records, and channel queues.
>
> 2) You need to be able to find contiguous memory for the stacks.  We
> haven't seen this to be a problem in real-world servers.
>
> 3) You need to figure out when to shrink a stack.  In Go we shrink
> stacks at GC time by finding stacks less than 1/4 used and shrinking
> them by half.  Shrinking logic is complicated by the fact that the
> shrinkee may be in a syscall, in which case shrinking is not possible
> because you might have passed a stack pointer into that syscall.
>
> The big advantage is that you only have to copy once (amortized).  The
> stack grows as big as you need it and then you don't need to do any
> more work from then on.  This saves lots of morestack calls, but
> perhaps more importantly it saves you from having to optimize
> everything fr

Re: Split Stack performance, signals

2015-09-13 Thread Anders Oleson
> It's not so much allowing signal handlers to be split stack.  It's handling 
> the case of a signal occurring while the stack is being split.  That is not 
> so very unlikely, and it will crash your program, as you get a SIGSEGV while 
> trying to handle whatever signal just arrived.

I see. In what I prototyped, it would work OK if the signals used
limited stack because %rsp is always valid. It's no different than a
regular prolog that updates %rsp, except by an unusual amount.
However, this means first, signal handlers don't have very much
guaranteed slack to run in. You'd have to always keep some minimum.
The other problem is that I didn't consider that the signal handler
itself might want __morestack. There is a tiny window when fs:0x70 and
%rsp, and some links are out-of-sync. This would be dangerous to
re-enter.

Obviously during real allocation (slower path) code, the right thing
is to block signals. Below is all just thoughts on the fast paths.

If there was some way to hook the entry to any/all signal handlers,
this could be detected and fixed there. This would be very useful. Is
there some way to do it?

Regardless, would it be appropriate to either a) point out the big
performance benefit to calling SA_ONSTACK and
__splitstack_block_signals or b) make it even a default for
-fsplit-stack? If there was a way to hook sigaction, the
__splitstack_block_signals could at least be defaulted unless some
user signal was installed. Are there any signals caught by default the
large % of user programs that never call these routines?

Another possibility that option #3, ending on a fixed boundary, from
https://gcc.gnu.org/wiki/SplitStacks. This encodes both the location
of the stack bottom and %rsp into one register which fixes a lot of
the atomic update of %rsp wrt. signals problem. The __morestack would
still have to be quite careful, but I think it would work. I haven't
tried this for performance yet.

I also tried option #5, calling a dedicated function to adjust %rsp
from the prolog, btw. I couldn't get it as fast or small as the
current approach.

> In C/C++ it requires moving all unions that combine pointers with 
> non-pointers off the stack.

Wouldn't it also mean knowing what globals and heap functions and
casted integers might have references to the stack too? It sounds
pretty impossible to do for C/C++. Presumably Go doesn't have these
constructs.

-- Anders


On Sun, Sep 13, 2015 at 11:49 AM, Ian Lance Taylor  wrote:
> On Sat, Sep 12, 2015 at 11:38 PM, Anders Oleson  wrote:
>>
>> From examining the __morestack code, I found that the sigprocmask
>> system call is being called (twice?!) per __morestack, even when it
>> should just need to switch to the next allocated segment. I did read
>> the reason for that change: to allow signal handlers to be
>> split-stack, (ignoring that detail for the moment). A quick experiment
>> shows that removing the calls to __morestack_block_signals and
>> __morestack_unblock_signals brings the overhead of the hot split down
>> to around 60 clocks which is much more reasonable.
>
> It's not so much allowing signal handlers to be split stack.  It's
> handling the case of a signal occurring while the stack is being
> split.  That is not so very unlikely, and it will crash your program,
> as you get a SIGSEGV while trying to handle whatever signal just
> arrived.
>
> gccgo avoids this overhead by marking all signal handlers as
> SA_ONSTACK, calling sigaltstack, and calling
> __splitstack_block_signals to tell the morestack code that it does not
> need to block signals while splitting the stack.
>
>
>> However in concept simply switching stack segments *should* not be
>> hugely expensive. I made a proof-of-concept that only does the very
>> minimal work to switch from one segment to another. This is done in
>> assembler (conceptually in __morestack) and eliminates the call out to
>> "C" on the likely hot path where the boundary has already been crossed
>> and the next segment is already big enough. If you cache the details
>> of the next/prev segments (if present) in the space right below the
>> bottom (limit) of each stack segment, you can shrink the time down to
>> 5-6 clocks. This is probably close to the achievable floor, which was
>> in part what I was trying to find out.
>>
>> Summary:
>>   prolog overhead, no call to __morestack : < 1 clock
>>   stock call to __morestack (hot): > 4000 clocks
>>   without signal blocking: < 60 clocks
>>   potential best case: < 6 clocks
>
> This sounds great.
>
>
>> I have noticed that both Go and Rust have now abandoned the split
>> stack approach due to performance issues. In Go, the idea to have
>> zillions of tiny (go)co-routines or green threads is closer to my
>> interest area than the Rust use. Even on x64, I think there are still
>> reasons for wanting to break out of needing large linear stacks. Or it
>> may be useful to other embedded applications. But in Go, apparently
>> the fix is to copy the stack (all of it?) which

gcc-6-20150913 is now available

2015-09-13 Thread gccadmin
Snapshot gcc-6-20150913 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/6-20150913/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 6 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/trunk revision 227730

You'll find:

 gcc-6-20150913.tar.bz2   Complete GCC

  MD5=634fa73da0f35d556bc85dc207df003b
  SHA1=e1b8097816d8f788166f840d597971ec70657be4

Diffs from 6-20150906 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-6
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.