https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #15 from Wilco <wdijkstr at arm dot com> ---
(In reply to Evandro Menezes from comment #14)
> Compiling the test-case above with just -O2, I can reproduce the code I
> mentioned initially and easily measure the cycle count to run it on target
> using perf.
> 
> The binary created by GCC runs in about 447000 user cycles and the one
> created by LLVM, in about 499000 user cycles.  IOW, fused multiply-add is a
> win on A57.
> 
> Looking further why Geekbench's {D,S}GEMM performs worse with GCC than with
> LLVM, both using "-Ofast", GCC fails to vectorize the loop in
> "gemm_block_kernel", while LLVM does.
>   
> I should've done a more detailed analysis in this issue before submitting
> this bug, sorry.

Using -Ofast is not any different from -O3 -ffast-math when compiling
non-Fortran code. As comment 10 shows, both loops are vectorized, however LLVM
unrolls twice and uses multiple accumulators while GCC doesn't.

I still don't see what this has to do with A57. You should open a generic bug
about GCC not applying basic loop optimizations with -O3 (in fact limited
unrolling is useful even for -O2).

Reply via email to