https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64396

            Bug ID: 64396
           Summary: Missed optimization in post-loop register handling
           Product: gcc
           Version: 4.9.0
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: matt at godbolt dot org

I noticed a missed opportunity in GCC that Clang and ICC seem to take advantage
of. All versions of GCC I tested (up to 4.9.0) seem to have the same trouble.
The following source (for x86_64) shows up the problem:

-----
#include <stdint.h>

#define add_carry32(sum, v)  __asm__("addl %1, %0 ;"  \
"adcl $0, %0 ;"  \
: "=r" (sum)  \
: "g" ((uint32_t) v), "0" (sum))

unsigned sorta_checksum(const void* src, int n, unsigned sum)
{
  const uint32_t *s4 = (const uint32_t*) src;
  const uint32_t *es4 = s4 + (n >> 2);

  while( s4 != es4 ) {
    add_carry32(sum, *s4++);
  }

  add_carry32(sum, *(const uint16_t*) s4);
  return sum;
}
-----

$ g++ -O3 path-to-file -c
$ objdump file.o
...
  10:    74 24                    je     36 <_Z14sorta_checksumPKvij+0x36>
  12:    66 0f 1f 44 00 00        nopw   0x0(%rax,%rax,1)
  18:    03 11                    add    (%rcx),%edx
  1a:    83 d2 00                 adc    $0x0,%edx
  1d:    48 83 c1 04              add    $0x4,%rcx
  21:    48 39 c8                 cmp    %rcx,%rax
  24:    75 f2                    jne    18 <_Z14sorta_checksumPKvij+0x18>
  26:    48 8d 4f 04              lea    0x4(%rdi),%rcx
  2a:    48 29 c8                 sub    %rcx,%rax
  2d:    48 c1 e8 02              shr    $0x2,%rax
  31:    48 8d 4c 87 04           lea    0x4(%rdi,%rax,4),%rcx
...

(the example is a contrived version of the original code, which comes
from Solarflare's OpenOnload project).

GCC optimizes the loop but then re-calculates the "s4" variable outside of the
loop (offsets 26 through 31 in the above code) before the last add_carry32. 
ICC and Clang both realise that the 's4' value in the loop is fine to re-use.
GCC has an extra four instructions to calculate the same value known to be in a
register upon loop exit.

Compiler explorer links:
GCC 4.9.0: http://goo.gl/fi3p2J
ICC 13.0.1: http://goo.gl/PRTTc6
Clang 3.4.1: http://goo.gl/95JEQc

Reply via email to