https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72517
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |ASSIGNED Last reconfirmed| |2016-07-26 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- Note that the patch should strictly allow more vectorization - _especially_ for the case we see strided group stores. For the strided group store case we extract sub-vectors that might be not natively supported (like V2SI) and then store them. gcc.dg/vect/slp-45.c contains a parametric testcase that lets you explore things. I wonder what the difference is for -fopt-info-vec before/after the revision and which of the now vectorized loops are in hot areas of the benchmark. On x86_64 I see that for example extracting V2QI from V16QI works just fine: pextrw $1, %xmm0, %ecx movw %cx, (%rdx) but for some reason extracting QImode from V16QI causes us to spill to the stack (in an awkward way due to char aliasing?): movdqa (%rdi), %xmm0 movslq %edx, %rdx movaps %xmm0, 392(%rsp) movzbl 392(%rsp), %eax movaps %xmm0, 616(%rsp) movaps %xmm0, 600(%rsp) movaps %xmm0, 584(%rsp) movb %al, (%rsi) movzbl 617(%rsp), %eax movzbl 587(%rsp), %ecx movdqa 16(%rdi), %xmm1 movb %al, 1(%rsi) ... (in the slp-45.c testcase this is the foo_char_3 function). Extracting V4QI from V16QI also results in awkward code. I suspect that both the expander and the target are to blame (and the vectorizer for its coarse cost model with just a single cost for vector-to-element). Similar to vec_construct vec_extract is designed for element extracts, not sub-vector extracts. Expansion does some tricks with punning the source vector but only to supported modes. That said, slp-45.c contains plenty examples that create non-profitable vectorizations but knowing the case in question for cactusADM would be nice.