https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> So for set with T == int and N == 32 we could generate
> 
>         vmovd   %edi, %xmm1
>         vpbroadcastd    %xmm1, %ymm1
>         vpcmpeqd        .LC0(%rip), %ymm1, %ymm2
>         vpblendvb       %ymm2, %ymm1, %ymm0, %ymm0
>         ret
> 
> .LC0:
>         .long   0
>         .long   1
>         .long   2
>         .long   3
>         .long   4
>         .long   5
>         .long   6
>         .long   7
> 
> aka, with GCC generic vectors
> 
> V setg (V v, int idx, T val)
> {
>   V valv = (V){idx, idx, idx, idx, idx, idx, idx, idx};
>   V mask = ((V){0, 1, 2, 3, 4, 5, 6, 7} == valv);
>   v = (v & ~mask) | (valv & mask);
>   return v;
> }

Botched this up, corrected is

V setg (V v, int idx, T val)
{
  V idxv = (V){idx, idx, idx, idx, idx, idx, idx, idx};
  V valv = (V){val, val, val, val, val, val, val, val};
  V mask = ((V){0, 1, 2, 3, 4, 5, 6, 7} == idxv);
  v = (v & ~mask) | (valv & mask);
  return v;
}

which produces

        vmovd   %edi, %xmm1
        vmovd   %esi, %xmm2
        vpbroadcastd    %xmm1, %ymm1
        vpbroadcastd    %xmm2, %ymm2
        vpcmpeqd        .LC0(%rip), %ymm1, %ymm1
        vpblendvb       %ymm1, %ymm2, %ymm0, %ymm0

with AVX2, so one more vmovd/vpbroadcastd (as expected).  With -mavx512vl
this even becomes

        vpbroadcastd    %edi, %ymm1
        vpcmpd  $0, .LC0(%rip), %ymm1, %k1
        vpbroadcastd    %esi, %ymm0{%k1}

for the extract case we really need to compute a variable permute mask
which looks harder and possibly more expensive than the spill/load,
so the set case looks more important to tackle (tackling it will still
eventually improve initial RTL generation by avoiding stack assignments
for locals)

> There's ongoing patch iteration on the ml adding variable index vec_set
> expanders for powerpc (and the related middle-end changes).  The question
> is whether optabs can try many things or the target should have the choice
> (probably better).
> 
> Eventually there's a more efficient way to generate {0, 1, 2, 3...}.

Reply via email to