https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #14 from Evandro Menezes <e.menezes at samsung dot com> ---
Compiling the test-case above with just -O2, I can reproduce the code I
mentioned initially and easily measure the cycle count to run it on target
using perf.

The binary created by GCC runs in about 447000 user cycles and the one created
by LLVM, in about 499000 user cycles.  IOW, fused multiply-add is a win on A57.

Looking further why Geekbench's {D,S}GEMM performs worse with GCC than with
LLVM, both using "-Ofast", GCC fails to vectorize the loop in
"gemm_block_kernel", while LLVM does.

I should've done a more detailed analysis in this issue before submitting this
bug, sorry.

Reply via email to