https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101846

--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
expand_vec_perm_1 is supposed to generate 1 instruction, but it doesn't
consider load of const_vector, if we handle (In reply to Hongtao.liu from
comment #2)
> For foo, vmovdqa is avx_vec_concatv16si/2, and we can add
> define_insn_and_split to combine avx_vec_concatv16si/2 and
> avx512f_zero_extendv16hiv16si2_1, similar for other modes in
> pmovzx{bw,wd,dq}.
> 
> For bar, we need to match pmov{wb,dw,qd} in ix86_vectorize_vec_perm_const
> when only one operand is used and selector are truncate index, just like we
> did for pmovzx.
> 
> I'll take this.

For bar when there's real use for upper bits like
v32hi
foo_dw_512 (v32hi x)
{
  return __builtin_shufflevector (x, x,
                                  0, 2, 4, 6, 8, 10, 12, 14,
                                  16, 18, 20, 22, 24, 26, 28, 30,
                                  16, 17, 18, 19, 20, 21, 22, 23,
                                  24, 25, 26, 27, 28, 29, 30, 31);
}

The vpmovdw version seems still better

-       vmovdqa64       %zmm0, %zmm1
-       vmovdqa64       .LC0(%rip), %zmm0
-       vpermi2w        %zmm1, %zmm1, %zmm0
+       vpmovdw %zmm0, %ymm1
+       vinserti64x4    $0x0, %ymm1, %zmm0, %zmm0

The conclusion hold true for other 256/512bit modes, but not 128-bit modes.

-       vpshufb .LC2(%rip), %xmm0, %xmm0
+       vpmovdw %xmm0, %xmm1
+       vmovq   %xmm1, %rax
+       vpinsrq $0, %rax, %xmm0, %xmm0

Reply via email to