[Bug middle-end/99638] s132 and s281 benchmarks of TSVC on zen3 benefits from -mno-fma

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 18 Mar 2021 02:17:16 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99638


--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
.L4:
        vmovups b(%rax), %ymm0
        addq    $32, %rax
        vfmadd213ps     aa+988(%rax), %ymm1, %ymm0
        vmovups %ymm0, aa-32(%rax)
        cmpq    $996, %rax
        jne     .L4

vs.

.L4:
        vmulps  b(%rax), %ymm2, %ymm0
        addq    $32, %rax
        vaddps  aa+988(%rax), %ymm0, %ymm0
        vmovups %ymm0, aa-32(%rax)
        cmpq    $996, %rax
        jne     .L4

I'm not sure we can explain the difference, can we?

On Zen2 -mfma doesn't make a difference btw. (but Zen3 should have FMA
with one cycle less latency even...)

The 2nd testcase has one more load uop in the loop.  Both Zen2 and Zen3
should be able to issue two load uops per cycle.  The 2nd testcase is not
vectorized, on Zen2 -mno-fma vs. -mfma is in the noise (-mfma looks slightly
faster).

[Bug middle-end/99638] s132 and s281 benchmarks of TSVC on zen3 benefits from -mno-fma

Reply via email to