https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
We now also apply SLP vectorizing the loop, but as said the high VF is probably
prohibitive and causes quite some spilling:

.L7:
        vmovdqu (%r14), %ymm2
        vmovdqu 32(%r14), %ymm1
        subq    $-128, %r14
        subq    $-128, %rdx
        vmovups -128(%rdx), %ymm10
        vmovdqu -64(%r14), %ymm0
        vpshufb .LC7(%rip), %ymm2, %ymm4
        vmovups -96(%rdx), %ymm9
        vmovups -64(%rdx), %ymm8
        vpshufb .LC8(%rip), %ymm1, %ymm3
        vpermq  $78, %ymm4, %ymm4
        vpermq  $78, %ymm3, %ymm3
...
        vmulps  %ymm7, %ymm0, %ymm0
        vaddps  136(%rsp), %ymm0, %ymm7
        vaddps  %ymm3, %ymm15, %ymm15
        vmovaps %ymm4, 168(%rsp)
        vmovaps %ymm7, 136(%rsp)
        cmpq    %r13, %r14
        jne     .L7

Maybe we should more aggressively reject vectorization when the VF is
equal to the smallest element number of vector lanes.  When we then
also detect SLP this usually means BB-level SLP can do something.
Note we fail to support V2SF -> V2QI now, not sure what changed here.
vectorizable_conversion doesn't support float->int->short->char but
only either float->char, float->int->char or float->short->char, but
at least for 2-element vectors we don't support these (the vectorizer
could support extra intermediate steps as well).

Reply via email to