On Wed, Sep 16, 2015 at 1:18 AM, Anders Oleson <and...@openpuma.org> wrote:
>
> How difficult is it to modify the prologs that get generated? I think
> I found the code that does that in i386.c and i386.md, but it is
> pretty cryptic to me. Any pointers? I know exactly what I want the
> assembler to look like. If so I can reduce the overhead from 36 bytes
> to 27 for best performance and 21 for best size.

The prologue is generated by ix86_expand_split_stack_prologue in
gcc/config/i386/i386.c.  Most of the instructions are produced by
calls to emit_insn, which generates RTL, a GCC intermediate
representation.  The RTL will go through subsequent optimization
passes, which is good because it does things like move the unlikely
call to morestack out of line when optimizing (the jump over the call
is marked as likely by using a REG_BR_PROB note).  The code is
complicated because it has to handle many different cases of -regparm
options and ABIs.

The key to understanding what it is doing is probably understanding
RTL, which is well described in the GCC internals manual at
https://gcc.gnu.org/onlinedocs/gccint/RTL.html .  The key to
understanding machine-specific RTL like this is that you generate RTL
that is matched by the instruction patterns in i386.md.


> I have not yet played with Go. Keith mentioned having seen issues with
> performance variations - is there a representative Go project that I
> could build as a good full scale test/benchmark with gccgo? I tried
> compiling GCC itself with the _stock_ -fsplit-stack by adding it to
> BOOT_CFLAGS. It did not go well. One of the code generator programs
> bombed, but it didn't expect it to work easily. Maybe a bit less
> *full* scale of a test than that ;)

Go benchmarks tend to be focused on networking performance that is
less interesting for what you are trying to do.  There are some
benchmarks in the test/bench directory
(https://github.com/golang/go/tree/master/test/bench).  They are not
great but they may help.

Ian

Reply via email to