from:"Anders Oleson"

Split Stack performance, signals

2015-09-12 Thread Anders Oleson

I have been experimenting with -fsplit-stack, in particular related to
performance issues I have read about in rust and go relative to a "hot
split" where a function call in a tight loop continuously crosses a
split. Originally I was interested as an analogy to approximate
performance issues with other similar stack manipulations. I think the
original idea is that __morestack would be called extremely rarely.
Unfortunately there is the edge case that a stack segment boundary
falls right in the middle of some hot function call.

I wrote a minimal routine called in a loop a few hundred million times
and I adjusted the stack usages until I found the split. I was really
surprised to see how bad the hot-split problem can be. Presumably the
actual allocation is done on the first split. The subsequent ones
should just link to the next segment "quickly" so the one-time
allocation cost can be ignored. I'm using an older stock Ubuntu x64
GCC 4.8.4 btw., but things don't appear to have changed recently.

It is taking >4000 clocks per __morestack call (about 1us on 4GHz Haswell)!

The difference between a non-split call and one with the split prolog
in the optimistic case where __morestack is not called for comparison
is < 1 clock. So the prolog is fairly efficient in execution time. It
is unfortunately large at around 36 bytes per function. I was able to
nibble away slightly at both the speed and size, but there isn't much
to gain.

>From examining the __morestack code, I found that the sigprocmask
system call is being called (twice?!) per __morestack, even when it
should just need to switch to the next allocated segment. I did read
the reason for that change: to allow signal handlers to be
split-stack, (ignoring that detail for the moment). A quick experiment
shows that removing the calls to __morestack_block_signals and
__morestack_unblock_signals brings the overhead of the hot split down
to around 60 clocks which is much more reasonable.

However in concept simply switching stack segments *should* not be
hugely expensive. I made a proof-of-concept that only does the very
minimal work to switch from one segment to another. This is done in
assembler (conceptually in __morestack) and eliminates the call out to
"C" on the likely hot path where the boundary has already been crossed
and the next segment is already big enough. If you cache the details
of the next/prev segments (if present) in the space right below the
bottom (limit) of each stack segment, you can shrink the time down to
5-6 clocks. This is probably close to the achievable floor, which was
in part what I was trying to find out.

Summary:
  prolog overhead, no call to __morestack : < 1 clock
  stock call to __morestack (hot): > 4000 clocks
  without signal blocking: < 60 clocks
  potential best case: < 6 clocks

I have noticed that both Go and Rust have now abandoned the split
stack approach due to performance issues. In Go, the idea to have
zillions of tiny (go)co-routines or green threads is closer to my
interest area than the Rust use. Even on x64, I think there are still
reasons for wanting to break out of needing large linear stacks. Or it
may be useful to other embedded applications. But in Go, apparently
the fix is to copy the stack (all of it?) which seems pretty drastic,
expensive and really tricky. At least it would only happen once. I was
wondering if there was any thought into doing more work to optimize
the -fsplit-stack? Does the Go stack-copy implementation have other
issues?

Another area I didn't explore was that certain leaf and small routines
with known maximum stack usage could avoid needing the prolog.This
might ameliorate much of the size issue.

Bottom line is that I don't know whether this is something anyone
still has any interest in, but in theory at least the "hot-split"
problem could be improved significantly. At least I learned what I was
trying to, and I put this out in case it is of use/interest to anyone.

-- Anders

Re: Split Stack performance, signals

2015-09-13 Thread Anders Oleson

u don't need to do any
> more work from then on.  This saves lots of morestack calls, but
> perhaps more importantly it saves you from having to optimize
> everything from morestack on down.  morestack is almost never called,
> so you can write it in a high-level language, put lots of debugging
> checks in it, etc.
>
> Another advantage is runtime predictability.  Split stacks were not a
> large performance problem, but they were a large source of unexplained
> performance variance.  We'd often see apparently no-op changes make
> runtime vary by +/- 5% because they change frame sizes by just a
> little bit and trigger or untrigger the hot split.  So if you test a
> code change and get a 4% improvement, is your code really better code,
> or did you just get lucky?  Maybe your code change slows things down
> by 1% but happens to avoid a hot split.
>
> On Sat, Sep 12, 2015 at 11:38 PM, Anders Oleson  wrote:
>> I have been experimenting with -fsplit-stack, in particular related to
>> performance issues I have read about in rust and go relative to a "hot
>> split" where a function call in a tight loop continuously crosses a
>> split. Originally I was interested as an analogy to approximate
>> performance issues with other similar stack manipulations. I think the
>> original idea is that __morestack would be called extremely rarely.
>> Unfortunately there is the edge case that a stack segment boundary
>> falls right in the middle of some hot function call.
>>
>> I wrote a minimal routine called in a loop a few hundred million times
>> and I adjusted the stack usages until I found the split. I was really
>> surprised to see how bad the hot-split problem can be. Presumably the
>> actual allocation is done on the first split. The subsequent ones
>> should just link to the next segment "quickly" so the one-time
>> allocation cost can be ignored. I'm using an older stock Ubuntu x64
>> GCC 4.8.4 btw., but things don't appear to have changed recently.
>>
>> It is taking >4000 clocks per __morestack call (about 1us on 4GHz Haswell)!
>>
>> The difference between a non-split call and one with the split prolog
>> in the optimistic case where __morestack is not called for comparison
>> is < 1 clock. So the prolog is fairly efficient in execution time. It
>> is unfortunately large at around 36 bytes per function. I was able to
>> nibble away slightly at both the speed and size, but there isn't much
>> to gain.
>>
>> From examining the __morestack code, I found that the sigprocmask
>> system call is being called (twice?!) per __morestack, even when it
>> should just need to switch to the next allocated segment. I did read
>> the reason for that change: to allow signal handlers to be
>> split-stack, (ignoring that detail for the moment). A quick experiment
>> shows that removing the calls to __morestack_block_signals and
>> __morestack_unblock_signals brings the overhead of the hot split down
>> to around 60 clocks which is much more reasonable.
>>
>> However in concept simply switching stack segments *should* not be
>> hugely expensive. I made a proof-of-concept that only does the very
>> minimal work to switch from one segment to another. This is done in
>> assembler (conceptually in __morestack) and eliminates the call out to
>> "C" on the likely hot path where the boundary has already been crossed
>> and the next segment is already big enough. If you cache the details
>> of the next/prev segments (if present) in the space right below the
>> bottom (limit) of each stack segment, you can shrink the time down to
>> 5-6 clocks. This is probably close to the achievable floor, which was
>> in part what I was trying to find out.
>>
>> Summary:
>>   prolog overhead, no call to __morestack : < 1 clock
>>   stock call to __morestack (hot): > 4000 clocks
>>   without signal blocking: < 60 clocks
>>   potential best case: < 6 clocks
>>
>> I have noticed that both Go and Rust have now abandoned the split
>> stack approach due to performance issues. In Go, the idea to have
>> zillions of tiny (go)co-routines or green threads is closer to my
>> interest area than the Rust use. Even on x64, I think there are still
>> reasons for wanting to break out of needing large linear stacks. Or it
>> may be useful to other embedded applications. But in Go, apparently
>> the fix is to copy the stack (all of it?) which seems pretty drastic,
>> expensive and really tricky. At least it would only happen once. I was
>> wondering if there was any thought into doing more work to optimize
>> the -fsplit-stack? Does the Go stack-copy implementation have other
>> issues?
>>
>> Another area I didn't explore was that certain leaf and small routines
>> with known maximum stack usage could avoid needing the prolog.This
>> might ameliorate much of the size issue.
>>
>> Bottom line is that I don't know whether this is something anyone
>> still has any interest in, but in theory at least the "hot-split"
>> problem could be improved significantly. At least I learned what I was
>> trying to, and I put this out in case it is of use/interest to anyone.
>>
>> -- Anders

Re: Split Stack performance, signals

2015-09-13 Thread Anders Oleson

> It's not so much allowing signal handlers to be split stack.  It's handling 
> the case of a signal occurring while the stack is being split.  That is not 
> so very unlikely, and it will crash your program, as you get a SIGSEGV while 
> trying to handle whatever signal just arrived.

I see. In what I prototyped, it would work OK if the signals used
limited stack because %rsp is always valid. It's no different than a
regular prolog that updates %rsp, except by an unusual amount.
However, this means first, signal handlers don't have very much
guaranteed slack to run in. You'd have to always keep some minimum.
The other problem is that I didn't consider that the signal handler
itself might want __morestack. There is a tiny window when fs:0x70 and
%rsp, and some links are out-of-sync. This would be dangerous to
re-enter.

Obviously during real allocation (slower path) code, the right thing
is to block signals. Below is all just thoughts on the fast paths.

If there was some way to hook the entry to any/all signal handlers,
this could be detected and fixed there. This would be very useful. Is
there some way to do it?

Regardless, would it be appropriate to either a) point out the big
performance benefit to calling SA_ONSTACK and
__splitstack_block_signals or b) make it even a default for
-fsplit-stack? If there was a way to hook sigaction, the
__splitstack_block_signals could at least be defaulted unless some
user signal was installed. Are there any signals caught by default the
large % of user programs that never call these routines?

Another possibility that option #3, ending on a fixed boundary, from
https://gcc.gnu.org/wiki/SplitStacks. This encodes both the location
of the stack bottom and %rsp into one register which fixes a lot of
the atomic update of %rsp wrt. signals problem. The __morestack would
still have to be quite careful, but I think it would work. I haven't
tried this for performance yet.

I also tried option #5, calling a dedicated function to adjust %rsp
from the prolog, btw. I couldn't get it as fast or small as the
current approach.

> In C/C++ it requires moving all unions that combine pointers with 
> non-pointers off the stack.

Wouldn't it also mean knowing what globals and heap functions and
casted integers might have references to the stack too? It sounds
pretty impossible to do for C/C++. Presumably Go doesn't have these
constructs.

-- Anders

On Sun, Sep 13, 2015 at 11:49 AM, Ian Lance Taylor  wrote:
> On Sat, Sep 12, 2015 at 11:38 PM, Anders Oleson  wrote:
>>
>> From examining the __morestack code, I found that the sigprocmask
>> system call is being called (twice?!) per __morestack, even when it
>> should just need to switch to the next allocated segment. I did read
>> the reason for that change: to allow signal handlers to be
>> split-stack, (ignoring that detail for the moment). A quick experiment
>> shows that removing the calls to __morestack_block_signals and
>> __morestack_unblock_signals brings the overhead of the hot split down
>> to around 60 clocks which is much more reasonable.
>
> It's not so much allowing signal handlers to be split stack.  It's
> handling the case of a signal occurring while the stack is being
> split.  That is not so very unlikely, and it will crash your program,
> as you get a SIGSEGV while trying to handle whatever signal just
> arrived.
>
> gccgo avoids this overhead by marking all signal handlers as
> SA_ONSTACK, calling sigaltstack, and calling
> __splitstack_block_signals to tell the morestack code that it does not
> need to block signals while splitting the stack.
>
>
>> However in concept simply switching stack segments *should* not be
>> hugely expensive. I made a proof-of-concept that only does the very
>> minimal work to switch from one segment to another. This is done in
>> assembler (conceptually in __morestack) and eliminates the call out to
>> "C" on the likely hot path where the boundary has already been crossed
>> and the next segment is already big enough. If you cache the details
>> of the next/prev segments (if present) in the space right below the
>> bottom (limit) of each stack segment, you can shrink the time down to
>> 5-6 clocks. This is probably close to the achievable floor, which was
>> in part what I was trying to find out.
>>
>> Summary:
>>   prolog overhead, no call to __morestack : < 1 clock
>>   stock call to __morestack (hot): > 4000 clocks
>>   without signal blocking: < 60 clocks
>>   potential best case: < 6 clocks
>
> This sounds great.
>
>
>> I have noticed that both Go and Rust have now abandoned the split
>> stack approach due to performance issues. In Go, the idea to have
>> zill

Re: Split Stack performance, signals

2015-09-16 Thread Anders Oleson

...
>> Summary:
>>   prolog overhead, no call to __morestack : < 1 clock
>>   stock call to __morestack (hot): > 4000 clocks
>>   without signal blocking: < 60 clocks
>>   potential best case: < 6 clocks
>
> This sounds great.

The data structure I was experimenting with ended up to be not very
different than struct stack_segment. So I am adapting my standalone
test to morestack.S in libgcc. It may not achieve quite 6 clock cycles
within the existing framework, but it should be pretty close. But it
will be useful enough as a larger scale test to be worth a little
effort attempting it.

I also played with using the modulo page-size lower boundary (option
#5) instead. It would have solved one problem with atomic updates but
not all, and but would require very finicky book-keeping. FWIW, it
caused the prolog to slow down just slightly but was actually around
50% shorter. So using fs:0x70 still appears the best performance and a
good balance overall.

How difficult is it to modify the prologs that get generated? I think
I found the code that does that in i386.c and i386.md, but it is
pretty cryptic to me. Any pointers? I know exactly what I want the
assembler to look like. If so I can reduce the overhead from 36 bytes
to 27 for best performance and 21 for best size.

I have not yet played with Go. Keith mentioned having seen issues with
performance variations - is there a representative Go project that I
could build as a good full scale test/benchmark with gccgo? I tried
compiling GCC itself with the _stock_ -fsplit-stack by adding it to
BOOT_CFLAGS. It did not go well. One of the code generator programs
bombed, but it didn't expect it to work easily. Maybe a bit less
*full* scale of a test than that ;)

Split Stack performance, signals

Re: Split Stack performance, signals

Re: Split Stack performance, signals

Re: Split Stack performance, signals

4 matches

Site Navigation

Mail list logo

Footer information