...
>> Summary:
>>   prolog overhead, no call to __morestack : < 1 clock
>>   stock call to __morestack (hot): > 4000 clocks
>>   without signal blocking: < 60 clocks
>>   potential best case: < 6 clocks
>
> This sounds great.

The data structure I was experimenting with ended up to be not very
different than struct stack_segment. So I am adapting my standalone
test to morestack.S in libgcc. It may not achieve quite 6 clock cycles
within the existing framework, but it should be pretty close. But it
will be useful enough as a larger scale test to be worth a little
effort attempting it.

I also played with using the modulo page-size lower boundary (option
#5) instead. It would have solved one problem with atomic updates but
not all, and but would require very finicky book-keeping. FWIW, it
caused the prolog to slow down just slightly but was actually around
50% shorter. So using fs:0x70 still appears the best performance and a
good balance overall.

How difficult is it to modify the prologs that get generated? I think
I found the code that does that in i386.c and i386.md, but it is
pretty cryptic to me. Any pointers? I know exactly what I want the
assembler to look like. If so I can reduce the overhead from 36 bytes
to 27 for best performance and 21 for best size.

I have not yet played with Go. Keith mentioned having seen issues with
performance variations - is there a representative Go project that I
could build as a good full scale test/benchmark with gccgo? I tried
compiling GCC itself with the _stock_ -fsplit-stack by adding it to
BOOT_CFLAGS. It did not go well. One of the code generator programs
bombed, but it didn't expect it to work easily. Maybe a bit less
*full* scale of a test than that ;)

Reply via email to