https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99638

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
.L4:
        vmovups b(%rax), %ymm0
        addq    $32, %rax
        vfmadd213ps     aa+988(%rax), %ymm1, %ymm0
        vmovups %ymm0, aa-32(%rax)
        cmpq    $996, %rax
        jne     .L4

vs.

.L4:
        vmulps  b(%rax), %ymm2, %ymm0
        addq    $32, %rax
        vaddps  aa+988(%rax), %ymm0, %ymm0
        vmovups %ymm0, aa-32(%rax)
        cmpq    $996, %rax
        jne     .L4

I'm not sure we can explain the difference, can we?

On Zen2 -mfma doesn't make a difference btw. (but Zen3 should have FMA
with one cycle less latency even...)

The 2nd testcase has one more load uop in the loop.  Both Zen2 and Zen3
should be able to issue two load uops per cycle.  The 2nd testcase is not
vectorized, on Zen2 -mno-fma vs. -mfma is in the noise (-mfma looks slightly
faster).

Reply via email to