https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42612
--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> --- (In reply to Dmitry Baksheev from comment #6) > Please consider fixing this issue. Here is another example where not using > post-increment for loops produces suboptimal code on AArch64. The code is 4x > slower than LLVM-generated code for dot-product function: > > double dotprod(std::size_t n, > const double* __restrict__ a, > const double* __restrict__ b) > { > double ans = 0; > #if __clang__ > #pragma clang loop vectorize(assume_safety) > #else > #pragma GCC ivdep > #endif > for (std::size_t i = 0; i < n; ++i) { > ans += a[i] * b[i]; > } > return ans; > } > > > Compile with: $(CXX) -march=armv8.2-a -O3 dp.cpp > > GCC-generated loop does not have post-increment loads: > .L3: > > ldr d2, [x1, x3, lsl 3] > > ldr d1, [x2, x3, lsl 3] > > add x3, x3, 1 > > fmadd d0, d2, d1, d0 > > cmp x0, x3 > > bne .L3 > > Clang emits this: > .LBB0_4: > ldr d1, [x10], #8 > > ldr d2, [x8], #8 > > subs x9, x9, #1 > fmadd d0, d1, d2, d0 > > b.ne .LBB0_4 I suspect that is a different issue. And I suspect it is a target cost issue which depends on the core really. Because some cores the separate add is better.