I have been experimenting with -fsplit-stack, in particular related to performance issues I have read about in rust and go relative to a "hot split" where a function call in a tight loop continuously crosses a split. Originally I was interested as an analogy to approximate performance issues with other similar stack manipulations. I think the original idea is that __morestack would be called extremely rarely. Unfortunately there is the edge case that a stack segment boundary falls right in the middle of some hot function call.
I wrote a minimal routine called in a loop a few hundred million times and I adjusted the stack usages until I found the split. I was really surprised to see how bad the hot-split problem can be. Presumably the actual allocation is done on the first split. The subsequent ones should just link to the next segment "quickly" so the one-time allocation cost can be ignored. I'm using an older stock Ubuntu x64 GCC 4.8.4 btw., but things don't appear to have changed recently. It is taking >4000 clocks per __morestack call (about 1us on 4GHz Haswell)! The difference between a non-split call and one with the split prolog in the optimistic case where __morestack is not called for comparison is < 1 clock. So the prolog is fairly efficient in execution time. It is unfortunately large at around 36 bytes per function. I was able to nibble away slightly at both the speed and size, but there isn't much to gain. >From examining the __morestack code, I found that the sigprocmask system call is being called (twice?!) per __morestack, even when it should just need to switch to the next allocated segment. I did read the reason for that change: to allow signal handlers to be split-stack, (ignoring that detail for the moment). A quick experiment shows that removing the calls to __morestack_block_signals and __morestack_unblock_signals brings the overhead of the hot split down to around 60 clocks which is much more reasonable. However in concept simply switching stack segments *should* not be hugely expensive. I made a proof-of-concept that only does the very minimal work to switch from one segment to another. This is done in assembler (conceptually in __morestack) and eliminates the call out to "C" on the likely hot path where the boundary has already been crossed and the next segment is already big enough. If you cache the details of the next/prev segments (if present) in the space right below the bottom (limit) of each stack segment, you can shrink the time down to 5-6 clocks. This is probably close to the achievable floor, which was in part what I was trying to find out. Summary: prolog overhead, no call to __morestack : < 1 clock stock call to __morestack (hot): > 4000 clocks without signal blocking: < 60 clocks potential best case: < 6 clocks I have noticed that both Go and Rust have now abandoned the split stack approach due to performance issues. In Go, the idea to have zillions of tiny (go)co-routines or green threads is closer to my interest area than the Rust use. Even on x64, I think there are still reasons for wanting to break out of needing large linear stacks. Or it may be useful to other embedded applications. But in Go, apparently the fix is to copy the stack (all of it?) which seems pretty drastic, expensive and really tricky. At least it would only happen once. I was wondering if there was any thought into doing more work to optimize the -fsplit-stack? Does the Go stack-copy implementation have other issues? Another area I didn't explore was that certain leaf and small routines with known maximum stack usage could avoid needing the prolog.This might ameliorate much of the size issue. Bottom line is that I don't know whether this is something anyone still has any interest in, but in theory at least the "hot-split" problem could be improved significantly. At least I learned what I was trying to, and I put this out in case it is of use/interest to anyone. -- Anders