Hello. A couple years ago I got really excited about the gcc "split stacks" 
feature that was being developed, and I recently noticed that it is ready to 
use now. I thereby have been spending the last few days trying it with one of 
my side-projects (currently just a toy, but something I hope to have in 
production one day: an event mediator that makes usage of light-weight 
coroutine-based threads to implement various protocols).

Yesterday, I integrated support for the new -fsplit-stacks libgcc 
__splitstack_*context functions (the ones that were added to allow coroutine 
libraries to save/restore the splitstack context; I've linked to the relevant 
mailing list threads below this paragraph), and I noticed that using 
__splitstack_releasecontext didn't actually seem to cause anything to get 
deallocated (watching with strace: mmap, no munmap).

http://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg20517.html
http://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg21898.html

After staring at it some in gdb, I figured out why: a pointer is being passed 
as if it were a pointer to a pointer rather than a direct pointer, obscured and 
not found by the compiler because it is being cast and marshaled through the 
void *[10] array that is used to store split stack contexts (which, btw, might 
be better represented internally as a struct, to avoid issues like this ;P).

    223 __thread struct stack_segment *__morestack_segments
    224   __attribute__ ((visibility ("default")));
    ... ...
    992 __splitstack_getcontext (void *context[NUMBER_OFFSETS])
    993 {
    994   memset (context, 0, NUMBER_OFFSETS * sizeof (void *));
    995   context[MORESTACK_SEGMENTS] = (void *) __morestack_segments;
    ... ...
   1105 __splitstack_releasecontext (void *context[10])
   1106 {
   1107   __morestack_release_segments (context[MORESTACK_SEGMENTS], 1);
    ... ...
    441 struct dynamic_allocation_blocks *
    442 __morestack_release_segments (struct stack_segment **pp, int 
free_dynamic)

As demonstrated by these snippets, __morestack_segments is a pointer to a 
stack_segment; it is being stored in the context as a void *, but is being 
removed from the context and being passed directly to 
__morestack_release_segments, which in turn expects a pointer to a pointer to a 
stack_segment, not just a bare pointer to a stack segment. Probably quite 
simple to fix (although might be more complex than just "add a &").


While I am sending an e-mail regarding -fsplit-stack, though, I figured I would 
also mention some design issues I've noticed while using it. Some of these may 
just be "me being stupid" (as I've only been looking at this in depth over the 
last few days), but I at least have had this idea "on the back burner" for a 
long time now, and am actually integrating and consuming the APIs that are 
resulting. Feel free to ignore me.


1) The current implementation (maybe this is intended to change?) uses mmap() 
to allocate stack segments, which means that every allocation involves a system 
call, a lock in the kernel on a slow data structure (anon_vma), and has some 
non-zero probability of ending up with a separate VMA (which is not only slow, 
but in my understanding uses up a limited resource: you can only have 64k VMAs 
per process).

Is it possible to instead expose the functionality for allocating stack 
segments out of libgcc for easy replacement by coroutine runtimes? I would 
really love to be able to use my existing memory manager to allocate the stack 
segments. I realize that this allocation routine would need to be able to 
operate with almost no stack: that isn't a problem (I can always pivot to 
another stack if I need any stack).


2) I had seen a discussion on the mailing list regarding argument copying, and 
I must say I'm somewhat confused as to why it is sufficient to memcpy the 
arguments from the old stack to the new one: if I have an argument with a 
non-POD type that has a non-trivial copy constructor, it would seem like I need 
a copy operation to be able to use the object from the new stack (maybe, for 
example, it has an internal pointer).


3) If I have either blocked signals on my thread or have setup an alternate 
signal stack with sigaltstack, I can get away with super-tiny stacks. However, 
allocate_segment has a minimum stack size of MINSIGSTKSZ (I presume to allow 
for signals), which on some systems I use (such as Mac OS X) I've seen be set 
as high as 32kB. (Meanwhile, MINSIGSTKSZ on Linux is smaller than a page, so 
mmap() can't even allocate it.)


4) 10 64-bit words for the splitstack context is a really large amount of 
space. :( I don't even have that much CPU-state (there are only 8 registers 
that really need to be saved when switching between coroutines). Considering 
the stack segments form a doubly-linked-list, it would seem like 
MORESTACK_SEGMENTS and CURRENT_SEGMENT are redundant. I also feel like 
CURRENT_STACK could be worked around fairly well.


5) As currently implemented, the stack space check is added to every single 
function. However, some functions do not actually use the stack (or might even 
be avoiding memory accesses entirely). When I look at the disassembly of my 
project, I see many references to __morestack and "cmp    %fs:0x70,%rsp" in 
functions that would otherwise be just a few instructions long. Functions that 
don't use stack should avoid the check.


6) I have noticed that the purpose of having split stacks seems largely hobbled 
by the way the linker enforces humungous stacks on outgoing calls to 
non-split-stack code, even if that code isn't called. As an incredibly painful 
example: __splitstack_getcontext is not compiled with split-stack support, 
which means that the function I have to switch coroutines (called from every 
coroutine) allocates stack.

To explain what I mean by "even if that code isn't called": my code hardly ever 
throws exception, but because I support them I end up with _Unwind_resume in 
most of my functions; I thereby get burned with giant stacks. It would seem 
more ideal (although I see how this would be much more difficult) if there were 
some way to only allocate the larger stack as the call is made to the 
non-split-stack function, not when entering the split-stack one.

A lot of these problems would be solved if libgcc (and whatever friends, such 
as libsupc++) were themselves compiled with -fsplit-stack. Of course, I can't 
imagine that anyone would want to pay the performance penalty for that globally 
;P. So, is there some plan to either do that for the entire build, or to 
provide alternative versions of those libraries that can be linked to while 
using -fsplit-stack in your own code?

That said, I don't think that that entirely does away with this "uncalled 
function drags in stack requirements" problem, as I want to say the core issue 
comes down to how this interacts with the inliner. In many of these cases, the 
call to the non-split-stack function is in some leaf function of a giant call 
graph that was flattened to a single massive function during the optimization 
pass.

The result is that if you ever interact with non-split-stack code anywhere, you 
really need to be quite explicit about __noinline__ to keep it from tainting 
the stack requirements of other functions. Part of me feels like there must be 
a better way of handling the stack expansions (such as by putting it at the 
call-site in situations like this), although I realize that might be difficult 
with the linker in charge of it.

A specific idea that might help, however, is to set things up so that the PLT 
actually handles the stack increases when you are linking to functions that are 
in a dynamic library. That way, calls to open (for example) would not cause the 
function that called it to suddenly require a large stack, but instead only as 
control is transferred to open would the stack size increase. (This might be 
quite complex, though.)


7) Using the linker to handle the transition between split-stack and 
non-split-stack code seems like a good way to solve the problem of "we need 
large stacks when hitting external code", but in staring at the resulting code 
I have in my project I'm seeing that it isn't reliable: if you have a pointer 
to a function the linker will not know what you are calling. In my case, this 
is coming up often due to using std::function.

More awkwardly, split-stack functions that mention (but do not call) 
non-split-stack functions (such as to return their address) are being 
mis-flagged by the linker. Honestly, I question whether the linker 
fundamentally has enough information about what is going on to be able to make 
sufficiently accurate decisions with regards to stack constraints to warrant 
the painful abstraction breakage that split-stack uses. :(

That said, I don't have a better solution to suggest right now (I really want 
to say that having attributes available to declare split-stack functionality in 
the code would be better, but that has other ramifications), but I do have 
concerns that due to attempts to keep the ABI fixed decisions made now (when 
there seem to only be a single major user, Go) will lock in how the mechanism 
is capable of functioning in the future.


Well, if you did read any of that, thanks for taking the time to do so. I 
really appreciate this feature you've been working on, and have been excited by 
it for a while now. When I ran into the aforementioned bug in the splitstack 
context implementation, I figured I'd send an e-mail that explained it, and I 
hope that I then didn't waste too much of peoples' time with my other 
split-stack related musings. ;P

If some of these things are in the "yeah, changing that would be interesting, 
we are just don't have many people working on the feature", I'd be happy to 
throw some patches towards it. I hesitate to just start sending patches over 
the wall, however, without first doing some kind of verification that I have 
any clue what I'm doing; I certainly am not certain how things would be 
prioritized, or even really who is working on it. ;P

Sincerely,
Jay Freeman (saurik)
sau...@saurik.com

Reply via email to