https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |ASSIGNED
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot
gnu.org
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
While strided stores are now implemented the case is still not handled because
single-element interleaving takes precedence (and single-element interleaving
isn't supported for stores as that always produces gaps).
I have a patch that produces
.L2:
movdqu 16(%rax), %xmm1
addq $32, %rax
movdqu -32(%rax), %xmm0
shufps $136, %xmm1, %xmm0
paddd %xmm2, %xmm0
pshufd $85, %xmm0, %xmm1
movd %xmm0, -32(%rax)
movd %xmm1, -24(%rax)
movdqa %xmm0, %xmm1
punpckhdq %xmm0, %xmm1
pshufd $255, %xmm0, %xmm0
movd %xmm1, -16(%rax)
movd %xmm0, -8(%rax)
cmpq %rdx, %rax
jne .L2
when you disable the cost model. Otherwise it's deemed not profitable. Using
scatters for AVX could in theory make it profitable (not sure).
t.c:5:3: note: Cost model analysis:
Vector inside of loop cost: 13
Vector prologue cost: 1
Vector epilogue cost: 12
Scalar iteration cost: 3
Scalar outside cost: 0
Vector outside cost: 13
prologue iterations: 0
epilogue iterations: 4
t.c:5:3: note: cost model: the vector iteration cost = 13 divided by the scalar
iteration cost = 3 is greater or equal to the vectorization factor = 4.
t.c:5:3: note: not vectorized: vectorization not profitable.
t.c:5:3: note: not vectorized: vector version will never be profitable.
t.c:5:3: note: ==> examining statement: *_8 = _10;
t.c:5:3: note: vect_is_simple_use: operand _10
t.c:5:3: note: def_stmt: _10 = _9 + 7;
t.c:5:3: note: type of def: internal
t.c:5:3: note: vect_model_store_cost: inside_cost = 8, prologue_cost = 0 .
so the strided store has cost 8, that's 4 extracts plus 4 scalar stores.
With AVX we generate
vmovd %xmm0, -32(%rax)
vpextrd $1, %xmm0, -24(%rax)
vpextrd $2, %xmm0, -16(%rax)
vpextrd $3, %xmm0, -8(%rax)
so it can combine extract and store, with SSE2 we get
pshufd $85, %xmm0, %xmm1
movd %xmm0, -32(%rax)
movd %xmm1, -24(%rax)
movdqa %xmm0, %xmm1
punpckhdq %xmm0, %xmm1
pshufd $255, %xmm0, %xmm0
movd %xmm1, -16(%rax)
movd %xmm0, -8(%rax)
which is even worse than expected ;)
As usual the cost model isn't target aware enough here (and it errs on the
conservative side here)