https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #15 from Wilco <wdijkstr at arm dot com> --- (In reply to Evandro Menezes from comment #14) > Compiling the test-case above with just -O2, I can reproduce the code I > mentioned initially and easily measure the cycle count to run it on target > using perf. > > The binary created by GCC runs in about 447000 user cycles and the one > created by LLVM, in about 499000 user cycles. IOW, fused multiply-add is a > win on A57. > > Looking further why Geekbench's {D,S}GEMM performs worse with GCC than with > LLVM, both using "-Ofast", GCC fails to vectorize the loop in > "gemm_block_kernel", while LLVM does. > > I should've done a more detailed analysis in this issue before submitting > this bug, sorry. Using -Ofast is not any different from -O3 -ffast-math when compiling non-Fortran code. As comment 10 shows, both loops are vectorized, however LLVM unrolls twice and uses multiple accumulators while GCC doesn't. I still don't see what this has to do with A57. You should open a generic bug about GCC not applying basic loop optimizations with -O3 (in fact limited unrolling is useful even for -O2).