https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112384

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|aarch64                     |aarch64, x86_64-*-*
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2023-11-06

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  Note for f2 the target needs to support .VEC_EXTRACT with variable
index.

OTOH we miss to transform

  i_4 = VIEW_CONVERT_EXPR<int[4]>(t)[i_2];
  tt_5 = {i_4, i_4, i_4, i_4};

into

  tt_3 = {i_2, i_2, i_2, i_2};
  r_6 = VEC_PERM_EXPR <t_4(D), t_4(D), tt_3>;

but the complication is that 't' isn't in SSA form (which is also why
it goes through memory here).

On x86_64 with SSE4.1 we get

f1:
.LFB0:
        .cfi_startproc
        andl    $3, %edi
        movd    %edi, %xmm2
        pshufd  $0, %xmm2, %xmm1
        pslld   $2, %xmm1
        pshufb  .LC1(%rip), %xmm1
        paddb   .LC2(%rip), %xmm1
        pshufb  %xmm1, %xmm0
        ret

f2:
.LFB1:
        .cfi_startproc
        andl    $3, %edi
        movaps  %xmm0, -24(%rsp)
        movd    -24(%rsp,%rdi,4), %xmm1
        pshufd  $0, %xmm1, %xmm0
        ret

I suspect the memory case is actually faster.  With AVX512VL this
improves to

f1:
.LFB0:
        .cfi_startproc
        andl    $3, %edi
        vmovdqa %xmm0, %xmm1
        vpbroadcastd    %edi, %xmm0
        vpermi2d        %xmm1, %xmm1, %xmm0
        ret

f2:
.LFB1:
        .cfi_startproc
        andl    $3, %edi
        vmovdqa %xmm0, -24(%rsp)
        vpbroadcastd    -24(%rsp,%rdi,4), %xmm0
        ret

AVX2 has the odd

f1:
.LFB0:
        .cfi_startproc
        andl    $3, %edi
        vinserti128     $1, %xmm0, %ymm0, %ymm0
        vmovd   %edi, %xmm2
        vpbroadcastd    %xmm2, %xmm1
        vinserti128     $1, %xmm1, %ymm1, %ymm1
        vpermd  %ymm0, %ymm1, %ymm0
        vzeroupper
        ret

where sth feels wrong - f2 is similar to AVX512.  It's not clear whether
the f1 IL is better in the end.

Reply via email to