https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42612

--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Dmitry Baksheev from comment #6)
> Please consider fixing this issue. Here is another example where not using
> post-increment for loops produces suboptimal code on AArch64. The code is 4x
> slower than LLVM-generated code for dot-product function:
> 
>     double dotprod(std::size_t n, 
>          const double* __restrict__ a, 
>          const double* __restrict__ b) 
>     {
>         double ans = 0;
>         #if __clang__
>         #pragma clang loop vectorize(assume_safety)
>         #else
>         #pragma GCC ivdep
>         #endif  
>         for (std::size_t i = 0; i < n; ++i) {
>             ans += a[i] * b[i];
>         }
>         return ans;
>     }
> 
> 
> Compile with: $(CXX) -march=armv8.2-a -O3 dp.cpp
> 
> GCC-generated loop does not have post-increment loads:
>     .L3:                                                                    
> 
>         ldr d2, [x1, x3, lsl 3]                                             
> 
>         ldr d1, [x2, x3, lsl 3]                                             
> 
>         add x3, x3, 1                                                       
> 
>         fmadd   d0, d2, d1, d0                                              
> 
>         cmp x0, x3                                                          
> 
>         bne .L3 
> 
> Clang emits this:
>     .LBB0_4:
>         ldr d1, [x10], #8                                                   
> 
>         ldr d2, [x8], #8                                                    
> 
>         subs    x9, x9, #1
>         fmadd   d0, d1, d2, d0                                              
> 
>         b.ne    .LBB0_4

I suspect that is a different issue. And I suspect it is a target cost issue
which depends on the core really. Because some cores the separate add is
better.

Reply via email to