[Bug rtl-optimization/70408] New: reusing the same call-preserved register would give smaller code in some cases

peter at cordes dot ca Fri, 25 Mar 2016 00:36:13 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70408


            Bug ID: 70408
           Summary: reusing the same call-preserved register would give
                    smaller code in some cases
           Product: gcc
           Version: 6.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

int foo(int);  // not inlineable
int bar(int a) {
  return foo(a+2) + 5 * foo (a);
}

gcc (and clang and icc) all make bigger code than necessary for x86.  gcc uses
two call-preserved registers to save `a` and `foo(a+2)`.  Besides the extra
push/pop, stack alignment requires a sub/add esp,8 pair.

Combining data-movement with arithmetic wherever possible is also a win (using
lea), but gcc also misses out on that.

    # gcc6 snapshot 20160221 on godbolt (with -O3): http://goo.gl/dN5OXD
    pushq   %rbp
    pushq   %rbx
    movl    %edi, %ebx
    leal    2(%rdi), %edi      # why lea instead of add rdi,2?
    subq    $8, %rsp
    call    foo                # foo(a+2)
    movl    %ebx, %edi
    movl    %eax, %ebp
    call    foo                # foo(a)
    addq    $8, %rsp
    leal    (%rax,%rax,4), %eax
    popq    %rbx
    addl    %ebp, %eax
    popq    %rbp
    ret

clang 3.8 makes essentially the same code (but wastes an extra mov because it
doesn't produce the result in %eax).

By hand, the best I can come up with is:

    push    %rbx
    lea     2(%rdi), %ebx          # stash ebx=a+2
    call    foo                    # foo(a)
    mov     %ebx, %edi
    lea     (%rax,%rax,4), %ebx    # reuse ebx to stash 5*foo(a)
    call    foo                    # foo(a+2)
    add     %ebx, %eax
    pop     %rbx
    ret

Note that I do the calls to foo() in the other order, which allows more folding
of MOV into LEA.  The savings from that are somewhat orthogonal to the savings
from reusing the same call-preserved register.

Should I open a separate bug report for the failure to optimize by reordering
the calls?

I haven't tried to look closely at ARM or PPC code to see if they succeed at
combining data movement with math (prob. worth testing with `foo(a) * 4` since
x86's shift+add LEA is not widely available).  I didn't mark this as an
i386/x86-64 but, because the reuse of call-preserved registers affects all
architectures.


IDK if teaching gcc about either of these tricks would help with real code in
many cases, or how hard it would be.

[Bug rtl-optimization/70408] New: reusing the same call-preserved register would give smaller code in some cases

Reply via email to