... >> Summary: >> prolog overhead, no call to __morestack : < 1 clock >> stock call to __morestack (hot): > 4000 clocks >> without signal blocking: < 60 clocks >> potential best case: < 6 clocks > > This sounds great.
The data structure I was experimenting with ended up to be not very different than struct stack_segment. So I am adapting my standalone test to morestack.S in libgcc. It may not achieve quite 6 clock cycles within the existing framework, but it should be pretty close. But it will be useful enough as a larger scale test to be worth a little effort attempting it. I also played with using the modulo page-size lower boundary (option #5) instead. It would have solved one problem with atomic updates but not all, and but would require very finicky book-keeping. FWIW, it caused the prolog to slow down just slightly but was actually around 50% shorter. So using fs:0x70 still appears the best performance and a good balance overall. How difficult is it to modify the prologs that get generated? I think I found the code that does that in i386.c and i386.md, but it is pretty cryptic to me. Any pointers? I know exactly what I want the assembler to look like. If so I can reduce the overhead from 36 bytes to 27 for best performance and 21 for best size. I have not yet played with Go. Keith mentioned having seen issues with performance variations - is there a representative Go project that I could build as a good full scale test/benchmark with gccgo? I tried compiling GCC itself with the _stock_ -fsplit-stack by adding it to BOOT_CFLAGS. It did not go well. One of the code generator programs bombed, but it didn't expect it to work easily. Maybe a bit less *full* scale of a test than that ;)