https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617

--- Comment #18 from Mason <slash.tmp at free dot fr> ---
Hello Michael_S,

As far as I can see, massaging the source helps GCC generate optimal code
(in terms of instruction count, not convinced about scheduling).

#include <x86intrin.h>
typedef unsigned long long u64;
void add4i(u64 dst[4], const u64 A[4], const u64 B[4])
{
  unsigned char c = 0;
  c = _addcarry_u64(c, A[0], B[0], dst+0);
  c = _addcarry_u64(c, A[1], B[1], dst+1);
  c = _addcarry_u64(c, A[2], B[2], dst+2);
  c = _addcarry_u64(c, A[3], B[3], dst+3);
}


On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=znver1 all generate
the expected:

add4i:
        movq    (%rdx), %rax
        addq    (%rsi), %rax
        movq    %rax, (%rdi)
        movq    8(%rsi), %rax
        adcq    8(%rdx), %rax
        movq    %rax, 8(%rdi)
        movq    16(%rsi), %rax
        adcq    16(%rdx), %rax
        movq    %rax, 16(%rdi)
        movq    24(%rdx), %rax
        adcq    24(%rsi), %rax
        movq    %rax, 24(%rdi)
        ret

I'll run a few benchmarks to test optimal scheduling.

Reply via email to