[Bug target/82582] New: not quite optimal code for -2xy - 3*z: could use one less LEA for smaller code without increasing critical path latency for any input

peter at cordes dot ca Tue, 17 Oct 2017 09:20:06 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82582


            Bug ID: 82582
           Summary: not quite optimal code for -2*x*y - 3*z: could use one
                    less LEA for smaller code without increasing critical
                    path latency for any input
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

int foo32(int x, int y, int z) {
    return -2*x*y - 3*z;
}

gcc8.0.0 20171015 -O3   https://godbolt.org/g/tzBuHx

        imull   %esi, %edi            # x*y
        leal    0(,%rdx,4), %eax    # needs a disp32 = 0
        subl    %eax, %edx            # -3*z
        negl    %edi                  # -(x*y)
        leal    (%rdx,%rdi,2), %eax   # result

LEA runs on limited ports, and an index with no base needs a 4-byte disp32 = 0.
The critical-path latencies, assuming 2-operand imul is 3 cycles like on Intel:

x->res: imul, neg, lea = 5c
y->res: imul, neg, lea = 5c
z->res:  lea, sub, lea = 3c

This is better than gcc6.3 / gcc7.2 (which uses 3 LEA and is generally worse). 
It's also different from gcc4/gcc5 (6c from x to result, but only 2c from z to
result, so it's different but not worse or better in all cases).


clang5.0 does better: same latencies, smaller code size, and trades one LEA for
an ADD:
        imull   %esi, %edi
        addl    %edi, %edi
        leal    (%rdx,%rdx,2), %eax
        negl    %eax
        subl    %edi, %eax

x->res: imul, add, sub = 5c
y->res: imul, add, sub = 5c
z->res:  lea, neg, sub = 3c



related: poor code-gen for 32-bit code with this.  I haven't checked other
32-bit architectures.

long long foo64(int x, int y, int z) {
    return -2LL*x*(long long)y - 3LL*(long long)z;
}
// also on the godbolt link

gcc -m32 uses a 3-operand imul-immediate for `-2`, but some clunky shifting for
`-3`.  There's also a mull in there.

clang5.0 -m32 makes very nice code, using a one-operand imul for -3 and just
shld/add + sub/sbb (plus some mov instructions).  One-operand mul/imul is 3
uops on Intel with 2 clock throughput, but ADC is 2 uops on Intel
pre-Broadwell, so it's nice to avoid that.

related: add %esi,%esi / sbb %edi,%edi  is an interesting way to sign-extend a
32-bit input into a pair of registers while doubling it.  However, if it starts
in eax,  cltd / add %eax,%eax is much better.  (sbb same,same is only
recognized as dep-breaking on AMD Bulldozer-family and Ryzen.  On Intel it has
a false dep on the old value of the register, not just CF).

[Bug target/82582] New: not quite optimal code for -2*x*y - 3*z: could use one less LEA for smaller code without increasing critical path latency for any input

Reply via email to

[Bug target/82582] New: not quite optimal code for -2xy - 3*z: could use one less LEA for smaller code without increasing critical path latency for any input