https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82582
Bug ID: 82582
Summary: not quite optimal code for -2*x*y - 3*z: could use one
less LEA for smaller code without increasing critical
path latency for any input
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
int foo32(int x, int y, int z) {
return -2*x*y - 3*z;
}
gcc8.0.0 20171015 -O3 https://godbolt.org/g/tzBuHx
imull %esi, %edi # x*y
leal 0(,%rdx,4), %eax # needs a disp32 = 0
subl %eax, %edx # -3*z
negl %edi # -(x*y)
leal (%rdx,%rdi,2), %eax # result
LEA runs on limited ports, and an index with no base needs a 4-byte disp32 = 0.
The critical-path latencies, assuming 2-operand imul is 3 cycles like on Intel:
x->res: imul, neg, lea = 5c
y->res: imul, neg, lea = 5c
z->res: lea, sub, lea = 3c
This is better than gcc6.3 / gcc7.2 (which uses 3 LEA and is generally worse).
It's also different from gcc4/gcc5 (6c from x to result, but only 2c from z to
result, so it's different but not worse or better in all cases).
clang5.0 does better: same latencies, smaller code size, and trades one LEA for
an ADD:
imull %esi, %edi
addl %edi, %edi
leal (%rdx,%rdx,2), %eax
negl %eax
subl %edi, %eax
x->res: imul, add, sub = 5c
y->res: imul, add, sub = 5c
z->res: lea, neg, sub = 3c
related: poor code-gen for 32-bit code with this. I haven't checked other
32-bit architectures.
long long foo64(int x, int y, int z) {
return -2LL*x*(long long)y - 3LL*(long long)z;
}
// also on the godbolt link
gcc -m32 uses a 3-operand imul-immediate for `-2`, but some clunky shifting for
`-3`. There's also a mull in there.
clang5.0 -m32 makes very nice code, using a one-operand imul for -3 and just
shld/add + sub/sbb (plus some mov instructions). One-operand mul/imul is 3
uops on Intel with 2 clock throughput, but ADC is 2 uops on Intel
pre-Broadwell, so it's nice to avoid that.
related: add %esi,%esi / sbb %edi,%edi is an interesting way to sign-extend a
32-bit input into a pair of registers while doubling it. However, if it starts
in eax, cltd / add %eax,%eax is much better. (sbb same,same is only
recognized as dep-breaking on AMD Bulldozer-family and Ryzen. On Intel it has
a false dep on the old value of the register, not just CF).