https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

            Bug ID: 63503
           Summary: [AArch64] A57 executes fused multiply-add poorly in
                    some situations
           Product: gcc
           Version: 5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: e.menezes at samsung dot com
                CC: spop at gcc dot gnu.org
            Target: aarch64-*

Curious why Geekbench's {D,S}GEMM by GCC were 8-9% slower than by LLVM, I was
baffled to find that the code emitted by GCC for the innermost loop in the
algorithm core is actually very good:

.L8:
    ldr d2, [x8, w5, uxtw 3]
    ldr d1, [x7, w5, uxtw 3]
    add w5, w5, 1
    cmp w5, w6
    fmadd   d0, d2, d1, d0
    bne .L8

LLVM's code is not so neat:

.LBB0_10:
    ldr d1, [x27, x22, lsl #3]
    ldr d2, [x9, x22, lsl #3]
    fmul    d1, d1, d2
    fadd    d0, d0, d1
    add w21, w21, #1
    add x22, x22, #1
    cmp w21, w24, uxtw
    b.ne .LBB0_10

However, it runs faster.

Methinks that the A57 microarchitecture is performing tricks for discrete FP
operations but not for fused multiply-add, since both code sequences are
semantically the same.  Whatever it is, it seems that fused multiply-add, and
perhaps its cousins, is actually a performance hit only when one depends on the
results of a previous one, as in this case on the results of the fused
operation in the previous loop iteration.

I'll try to create a simple test-case, but, in the meantime, please chime in
about your thoughts.

Reply via email to