https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Last reconfirmed| |2021-07-02 Status|UNCONFIRMED |ASSIGNED --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- I will have a look next week. A quick look shows FMAs being used and addsub can break FMA detection until we get general optab support for fmaddsub and friends. So it might be { fma, fms } + blend compared to addsub + mul where the former maybe has lower latency though Agner says FMA (5c) + blend (1c) vs ADDSUB (3c) + MUL (3c). As said, I have to look into this in more detail. double a[4], b[4], c[4]; void foo () { c[0] = a[0] - b[0] * c[0]; c[1] = a[1] + b[1] * c[1]; c[2] = a[2] - b[2] * c[2]; c[3] = a[3] + b[3] * c[3]; } vmovapd a(%rip), %ymm2 vmovapd b(%rip), %ymm1 vmovapd b(%rip), %ymm0 vfmadd132pd c(%rip), %ymm2, %ymm1 vfnmadd132pd c(%rip), %ymm2, %ymm0 vshufpd $10, %ymm1, %ymm0, %ymm0 vmovapd %ymm0, c(%rip) vs. vmovapd b(%rip), %ymm1 vmovapd a(%rip), %ymm2 vmulpd c(%rip), %ymm1, %ymm0 vaddsubpd %ymm0, %ymm2, %ymm0 vmovapd %ymm0, c(%rip)