https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
We now also apply SLP vectorizing the loop, but as said the high VF is probably
prohibitive and causes quite some spilling:
.L7:
vmovdqu (%r14), %ymm2
vmovdqu 32(%r14), %ymm1
subq $-128, %r14
subq $-128, %rdx
vmovups -128(%rdx), %ymm10
vmovdqu -64(%r14), %ymm0
vpshufb .LC7(%rip), %ymm2, %ymm4
vmovups -96(%rdx), %ymm9
vmovups -64(%rdx), %ymm8
vpshufb .LC8(%rip), %ymm1, %ymm3
vpermq $78, %ymm4, %ymm4
vpermq $78, %ymm3, %ymm3
...
vmulps %ymm7, %ymm0, %ymm0
vaddps 136(%rsp), %ymm0, %ymm7
vaddps %ymm3, %ymm15, %ymm15
vmovaps %ymm4, 168(%rsp)
vmovaps %ymm7, 136(%rsp)
cmpq %r13, %r14
jne .L7
Maybe we should more aggressively reject vectorization when the VF is
equal to the smallest element number of vector lanes. When we then
also detect SLP this usually means BB-level SLP can do something.
Note we fail to support V2SF -> V2QI now, not sure what changed here.
vectorizable_conversion doesn't support float->int->short->char but
only either float->char, float->int->char or float->short->char, but
at least for 2-element vectors we don't support these (the vectorizer
could support extra intermediate steps as well).