https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617
--- Comment #19 from Michael_S <already5chosen at yahoo dot com> --- (In reply to Mason from comment #18) > Hello Michael_S, > > As far as I can see, massaging the source helps GCC generate optimal code > (in terms of instruction count, not convinced about scheduling). > > #include <x86intrin.h> > typedef unsigned long long u64; > void add4i(u64 dst[4], const u64 A[4], const u64 B[4]) > { > unsigned char c = 0; > c = _addcarry_u64(c, A[0], B[0], dst+0); > c = _addcarry_u64(c, A[1], B[1], dst+1); > c = _addcarry_u64(c, A[2], B[2], dst+2); > c = _addcarry_u64(c, A[3], B[3], dst+3); > } > > > On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=znver1 all generate > the expected: > > add4i: > movq (%rdx), %rax > addq (%rsi), %rax > movq %rax, (%rdi) > movq 8(%rsi), %rax > adcq 8(%rdx), %rax > movq %rax, 8(%rdi) > movq 16(%rsi), %rax > adcq 16(%rdx), %rax > movq %rax, 16(%rdi) > movq 24(%rdx), %rax > adcq 24(%rsi), %rax > movq %rax, 24(%rdi) > ret > > I'll run a few benchmarks to test optimal scheduling. That's not merely "massaging the source". That's changing semantics. Think about what happens when dst points to the middle of A or of B. The change of semantics effectively prevented vectorizer from doing harm. And yes, for common non-aliasing case the scheduling is problematic, too. It would probably not cause slowdown on the latest and greatest cores, but could be slow on less great cores, including your default Zen1.