https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70408
Bug ID: 70408 Summary: reusing the same call-preserved register would give smaller code in some cases Product: gcc Version: 6.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- int foo(int); // not inlineable int bar(int a) { return foo(a+2) + 5 * foo (a); } gcc (and clang and icc) all make bigger code than necessary for x86. gcc uses two call-preserved registers to save `a` and `foo(a+2)`. Besides the extra push/pop, stack alignment requires a sub/add esp,8 pair. Combining data-movement with arithmetic wherever possible is also a win (using lea), but gcc also misses out on that. # gcc6 snapshot 20160221 on godbolt (with -O3): http://goo.gl/dN5OXD pushq %rbp pushq %rbx movl %edi, %ebx leal 2(%rdi), %edi # why lea instead of add rdi,2? subq $8, %rsp call foo # foo(a+2) movl %ebx, %edi movl %eax, %ebp call foo # foo(a) addq $8, %rsp leal (%rax,%rax,4), %eax popq %rbx addl %ebp, %eax popq %rbp ret clang 3.8 makes essentially the same code (but wastes an extra mov because it doesn't produce the result in %eax). By hand, the best I can come up with is: push %rbx lea 2(%rdi), %ebx # stash ebx=a+2 call foo # foo(a) mov %ebx, %edi lea (%rax,%rax,4), %ebx # reuse ebx to stash 5*foo(a) call foo # foo(a+2) add %ebx, %eax pop %rbx ret Note that I do the calls to foo() in the other order, which allows more folding of MOV into LEA. The savings from that are somewhat orthogonal to the savings from reusing the same call-preserved register. Should I open a separate bug report for the failure to optimize by reordering the calls? I haven't tried to look closely at ARM or PPC code to see if they succeed at combining data movement with math (prob. worth testing with `foo(a) * 4` since x86's shift+add LEA is not widely available). I didn't mark this as an i386/x86-64 but, because the reuse of call-preserved registers affects all architectures. IDK if teaching gcc about either of these tricks would help with real code in many cases, or how hard it would be.