https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99638
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- .L4: vmovups b(%rax), %ymm0 addq $32, %rax vfmadd213ps aa+988(%rax), %ymm1, %ymm0 vmovups %ymm0, aa-32(%rax) cmpq $996, %rax jne .L4 vs. .L4: vmulps b(%rax), %ymm2, %ymm0 addq $32, %rax vaddps aa+988(%rax), %ymm0, %ymm0 vmovups %ymm0, aa-32(%rax) cmpq $996, %rax jne .L4 I'm not sure we can explain the difference, can we? On Zen2 -mfma doesn't make a difference btw. (but Zen3 should have FMA with one cycle less latency even...) The 2nd testcase has one more load uop in the loop. Both Zen2 and Zen3 should be able to issue two load uops per cycle. The 2nd testcase is not vectorized, on Zen2 -mno-fma vs. -mfma is in the noise (-mfma looks slightly faster).