https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|middle-end |tree-optimization
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot
gnu.org
Blocks| |53947
Keywords| |missed-optimization
Ever confirmed|0 |1
Depends on| |97832
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed| |2021-03-08
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
With -fno-tree-reassoc we detect the reduction chain and produce
.L3:
vmovaps b(%rax), %ymm5
vmovaps b+32(%rax), %ymm6
addq $160, %rax
vfmadd231ps a-160(%rax), %ymm5, %ymm1
vmovaps b-96(%rax), %ymm7
vfmadd231ps a-128(%rax), %ymm6, %ymm0
vmovaps b-64(%rax), %ymm5
vmovaps b-32(%rax), %ymm6
vfmadd231ps a-96(%rax), %ymm7, %ymm2
vfmadd231ps a-64(%rax), %ymm5, %ymm3
vfmadd231ps a-32(%rax), %ymm6, %ymm4
cmpq $128000, %rax
jne .L3
vaddps %ymm1, %ymm0, %ymm0
vaddps %ymm2, %ymm0, %ymm0
vaddps %ymm3, %ymm0, %ymm0
vaddps %ymm4, %ymm0, %ymm0
vextractf128 $0x1, %ymm0, %xmm1
vaddps %xmm0, %xmm1, %xmm1
vmovhlps %xmm1, %xmm1, %xmm0
vaddps %xmm1, %xmm0, %xmm0
vshufps $85, %xmm0, %xmm0, %xmm1
vaddps %xmm0, %xmm1, %xmm0
decl %edx
jne .L2
we're not re-rolling and thus are forced to use a VF of 4 here.
Note that LLVM doesn't seem to veectorize the loop but instead vectorizes
the basic-block which isn't what TSVC looks for (but that would work for
non-fast-math).
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
[Bug 97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than
-O3