https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64396
Bug ID: 64396 Summary: Missed optimization in post-loop register handling Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: matt at godbolt dot org I noticed a missed opportunity in GCC that Clang and ICC seem to take advantage of. All versions of GCC I tested (up to 4.9.0) seem to have the same trouble. The following source (for x86_64) shows up the problem: ----- #include <stdint.h> #define add_carry32(sum, v) __asm__("addl %1, %0 ;" \ "adcl $0, %0 ;" \ : "=r" (sum) \ : "g" ((uint32_t) v), "0" (sum)) unsigned sorta_checksum(const void* src, int n, unsigned sum) { const uint32_t *s4 = (const uint32_t*) src; const uint32_t *es4 = s4 + (n >> 2); while( s4 != es4 ) { add_carry32(sum, *s4++); } add_carry32(sum, *(const uint16_t*) s4); return sum; } ----- $ g++ -O3 path-to-file -c $ objdump file.o ... 10: 74 24 je 36 <_Z14sorta_checksumPKvij+0x36> 12: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 18: 03 11 add (%rcx),%edx 1a: 83 d2 00 adc $0x0,%edx 1d: 48 83 c1 04 add $0x4,%rcx 21: 48 39 c8 cmp %rcx,%rax 24: 75 f2 jne 18 <_Z14sorta_checksumPKvij+0x18> 26: 48 8d 4f 04 lea 0x4(%rdi),%rcx 2a: 48 29 c8 sub %rcx,%rax 2d: 48 c1 e8 02 shr $0x2,%rax 31: 48 8d 4c 87 04 lea 0x4(%rdi,%rax,4),%rcx ... (the example is a contrived version of the original code, which comes from Solarflare's OpenOnload project). GCC optimizes the loop but then re-calculates the "s4" variable outside of the loop (offsets 26 through 31 in the above code) before the last add_carry32. ICC and Clang both realise that the 's4' value in the loop is fine to re-use. GCC has an extra four instructions to calculate the same value known to be in a register upon loop exit. Compiler explorer links: GCC 4.9.0: http://goo.gl/fi3p2J ICC 13.0.1: http://goo.gl/PRTTc6 Clang 3.4.1: http://goo.gl/95JEQc