https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Re-checking today we reject AVX vectorization via the costmodel but do
SSE vectorization. With versioning for alias we could also SLP vectorize this,
keeping the loop body smaller and avoiding an epilogue. Esp. since we're
ending up without any vector load or store anyway.
Of course SLP analysis requires a grouped store which we do not have since
we do not identify XPQKL(MPQ,MKL) and XPQKL(MRS,MKL) as such (they ain't
with MPQ == MRS but the runtime alias check ensures that's not the case).
That is, we miss "strided group" detection or in general SLP forming via
different mechanism.
That said, I have a hard time thinking of a heuristic aligning with reality
(it's of course possible to come up with a hack).
Generally we'd need to work towards doing the versioning / cost model checks
on outer loops but the better versioning condition thing would be a
prerequesite for this.
I'm out of ideas suitable for GCC 9 (besides reverting the patch, reverting
to bogus state).
Scalar inner loop assembly:
.L8:
vmulsd (%rax,%rdi,8), %xmm3, %xmm0
incl %ecx
vfmadd231sd (%rax), %xmm4, %xmm0
vfmadd213sd (%rdx), %xmm6, %xmm0
vmovsd %xmm0, (%rdx)
vmulsd (%rax,%r8,8), %xmm1, %xmm0
vfmadd231sd (%rax,%r10,8), %xmm2, %xmm0
addq %r15, %rax
vfmadd213sd (%rdx,%rsi,8), %xmm5, %xmm0
vmovsd %xmm0, (%rdx,%rsi,8)
addq %rbp, %rdx
cmpl %r9d, %ecx
jne .L8
vectorized inner loop assembly:
.L9:
vmovsd (%r10,%rcx), %xmm13
vmovsd (%rdx), %xmm0
incl %r14d
vmovhpd (%r10,%rsi), %xmm13, %xmm13
vmovhpd (%rdx,%r13), %xmm0, %xmm14
vmovsd (%rdi,%rcx), %xmm0
vmulpd %xmm9, %xmm13, %xmm13
vmovhpd (%rdi,%rsi), %xmm0, %xmm0
vfmadd132pd %xmm10, %xmm13, %xmm0
vfmadd132pd %xmm12, %xmm14, %xmm0
vmovlpd %xmm0, (%rdx)
vmovhpd %xmm0, (%rdx,%r13)
vmovsd (%r8,%rcx), %xmm13
vmovsd (%rax), %xmm0
addq %r11, %rdx
vmovhpd (%r8,%rsi), %xmm13, %xmm13
vmovhpd (%rax,%r13), %xmm0, %xmm14
vmovsd (%r9,%rcx), %xmm0
addq %rbx, %rcx
vmulpd %xmm7, %xmm13, %xmm13
vmovhpd (%r9,%rsi), %xmm0, %xmm0
addq %rbx, %rsi
vfmadd132pd %xmm8, %xmm13, %xmm0
vfmadd132pd %xmm11, %xmm14, %xmm0
vmovlpd %xmm0, (%rax)
vmovhpd %xmm0, (%rax,%r13)
addq %r11, %rax
cmpl %r14d, %r15d
jne .L9
only outer loop context and knowledge of low trip count makes this bad.
The cost modeling doesn't know the scalar loop can execute like if
vectorized given the CPUs plenty of resources (speculating
non-dependence), whereas the vector variant introduces more constraints
to the pipelining due to data dependences from using vectors. But
even IACA doesn't tell us the differences are big.