https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #18 from Mason <slash.tmp at free dot fr> --- Hello Michael_S, As far as I can see, massaging the source helps GCC generate optimal code (in terms of instruction count, not convinced about scheduling). #include <x86intrin.h> typedef unsigned long long u64; void add4i(u64 dst[4], const u64 A[4], const u64 B[4]) { unsigned char c = 0; c = _addcarry_u64(c, A[0], B[0], dst+0); c = _addcarry_u64(c, A[1], B[1], dst+1); c = _addcarry_u64(c, A[2], B[2], dst+2); c = _addcarry_u64(c, A[3], B[3], dst+3); } On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=znver1 all generate the expected: add4i: movq (%rdx), %rax addq (%rsi), %rax movq %rax, (%rdi) movq 8(%rsi), %rax adcq 8(%rdx), %rax movq %rax, 8(%rdi) movq 16(%rsi), %rax adcq 16(%rdx), %rax movq %rax, 16(%rdi) movq 24(%rdx), %rax adcq 24(%rsi), %rax movq %rax, 24(%rdi) ret I'll run a few benchmarks to test optimal scheduling.