On Wed, Sep 16, 2015 at 1:18 AM, Anders Oleson <and...@openpuma.org> wrote: > > How difficult is it to modify the prologs that get generated? I think > I found the code that does that in i386.c and i386.md, but it is > pretty cryptic to me. Any pointers? I know exactly what I want the > assembler to look like. If so I can reduce the overhead from 36 bytes > to 27 for best performance and 21 for best size.
The prologue is generated by ix86_expand_split_stack_prologue in gcc/config/i386/i386.c. Most of the instructions are produced by calls to emit_insn, which generates RTL, a GCC intermediate representation. The RTL will go through subsequent optimization passes, which is good because it does things like move the unlikely call to morestack out of line when optimizing (the jump over the call is marked as likely by using a REG_BR_PROB note). The code is complicated because it has to handle many different cases of -regparm options and ABIs. The key to understanding what it is doing is probably understanding RTL, which is well described in the GCC internals manual at https://gcc.gnu.org/onlinedocs/gccint/RTL.html . The key to understanding machine-specific RTL like this is that you generate RTL that is matched by the instruction patterns in i386.md. > I have not yet played with Go. Keith mentioned having seen issues with > performance variations - is there a representative Go project that I > could build as a good full scale test/benchmark with gccgo? I tried > compiling GCC itself with the _stock_ -fsplit-stack by adding it to > BOOT_CFLAGS. It did not go well. One of the code generator programs > bombed, but it didn't expect it to work easily. Maybe a bit less > *full* scale of a test than that ;) Go benchmarks tend to be focused on networking performance that is less interesting for what you are trying to do. There are some benchmarks in the test/bench directory (https://github.com/golang/go/tree/master/test/bench). They are not great but they may help. Ian