https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
Bug ID: 63503 Summary: [AArch64] A57 executes fused multiply-add poorly in some situations Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: e.menezes at samsung dot com CC: spop at gcc dot gnu.org Target: aarch64-* Curious why Geekbench's {D,S}GEMM by GCC were 8-9% slower than by LLVM, I was baffled to find that the code emitted by GCC for the innermost loop in the algorithm core is actually very good: .L8: ldr d2, [x8, w5, uxtw 3] ldr d1, [x7, w5, uxtw 3] add w5, w5, 1 cmp w5, w6 fmadd d0, d2, d1, d0 bne .L8 LLVM's code is not so neat: .LBB0_10: ldr d1, [x27, x22, lsl #3] ldr d2, [x9, x22, lsl #3] fmul d1, d1, d2 fadd d0, d0, d1 add w21, w21, #1 add x22, x22, #1 cmp w21, w24, uxtw b.ne .LBB0_10 However, it runs faster. Methinks that the A57 microarchitecture is performing tricks for discrete FP operations but not for fused multiply-add, since both code sequences are semantically the same. Whatever it is, it seems that fused multiply-add, and perhaps its cousins, is actually a performance hit only when one depends on the results of a previous one, as in this case on the results of the fused operation in the previous loop iteration. I'll try to create a simple test-case, but, in the meantime, please chime in about your thoughts.