https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64396
Bug ID: 64396
Summary: Missed optimization in post-loop register handling
Product: gcc
Version: 4.9.0
Status: UNCONFIRMED
Severity: minor
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: matt at godbolt dot org
I noticed a missed opportunity in GCC that Clang and ICC seem to take advantage
of. All versions of GCC I tested (up to 4.9.0) seem to have the same trouble.
The following source (for x86_64) shows up the problem:
-----
#include <stdint.h>
#define add_carry32(sum, v) __asm__("addl %1, %0 ;" \
"adcl $0, %0 ;" \
: "=r" (sum) \
: "g" ((uint32_t) v), "0" (sum))
unsigned sorta_checksum(const void* src, int n, unsigned sum)
{
const uint32_t *s4 = (const uint32_t*) src;
const uint32_t *es4 = s4 + (n >> 2);
while( s4 != es4 ) {
add_carry32(sum, *s4++);
}
add_carry32(sum, *(const uint16_t*) s4);
return sum;
}
-----
$ g++ -O3 path-to-file -c
$ objdump file.o
...
10: 74 24 je 36 <_Z14sorta_checksumPKvij+0x36>
12: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
18: 03 11 add (%rcx),%edx
1a: 83 d2 00 adc $0x0,%edx
1d: 48 83 c1 04 add $0x4,%rcx
21: 48 39 c8 cmp %rcx,%rax
24: 75 f2 jne 18 <_Z14sorta_checksumPKvij+0x18>
26: 48 8d 4f 04 lea 0x4(%rdi),%rcx
2a: 48 29 c8 sub %rcx,%rax
2d: 48 c1 e8 02 shr $0x2,%rax
31: 48 8d 4c 87 04 lea 0x4(%rdi,%rax,4),%rcx
...
(the example is a contrived version of the original code, which comes
from Solarflare's OpenOnload project).
GCC optimizes the loop but then re-calculates the "s4" variable outside of the
loop (offsets 26 through 31 in the above code) before the last add_carry32.
ICC and Clang both realise that the 's4' value in the loop is fine to re-use.
GCC has an extra four instructions to calculate the same value known to be in a
register upon loop exit.
Compiler explorer links:
GCC 4.9.0: http://goo.gl/fi3p2J
ICC 13.0.1: http://goo.gl/PRTTc6
Clang 3.4.1: http://goo.gl/95JEQc