On Thu, Sep 24, 2020 at 9:38 PM Segher Boessenkool
<[email protected]> wrote:
>
> Hi!
>
> On Thu, Sep 24, 2020 at 04:55:21PM +0200, Richard Biener wrote:
> > Btw, on x86_64 the following produces sth reasonable:
> >
> > #define N 32
> > typedef int T;
> > typedef T V __attribute__((vector_size(N)));
> > V setg (V v, int idx, T val)
> > {
> > V valv = (V){idx, idx, idx, idx, idx, idx, idx, idx};
> > V mask = ((V){0, 1, 2, 3, 4, 5, 6, 7} == valv);
> > v = (v & ~mask) | (valv & mask);
> > return v;
> > }
> >
> > vmovd %edi, %xmm1
> > vpbroadcastd %xmm1, %ymm1
> > vpcmpeqd .LC0(%rip), %ymm1, %ymm2
> > vpblendvb %ymm2, %ymm1, %ymm0, %ymm0
> > ret
> >
> > I'm quite sure you could do sth similar on power?
>
> This only allows inserting aligned elements. Which is probably fine
> of course (we don't allow elements that straddle vector boundaries
> either, anyway).
>
> And yes, we can do that :-)
>
> That should be
> #define N 32
> typedef int T;
> typedef T V __attribute__((vector_size(N)));
> V setg (V v, int idx, T val)
> {
> V valv = (V){val, val, val, val, val, val, val, val};
> V idxv = (V){idx, idx, idx, idx, idx, idx, idx, idx};
> V mask = ((V){0, 1, 2, 3, 4, 5, 6, 7} == idxv);
> v = (v & ~mask) | (valv & mask);
> return v;
> }
Whoops yeah, simplified it a bit too much ;)
> after which I get (-march=znver2)
>
> setg:
> vmovd %edi, %xmm1
> vmovd %esi, %xmm2
> vpbroadcastd %xmm1, %ymm1
> vpbroadcastd %xmm2, %ymm2
> vpcmpeqd .LC0(%rip), %ymm1, %ymm1
> vpandn %ymm0, %ymm1, %ymm0
> vpand %ymm2, %ymm1, %ymm1
> vpor %ymm0, %ymm1, %ymm0
> ret
I get with -march=znver2 -O2
vmovd %edi, %xmm1
vmovd %esi, %xmm2
vpbroadcastd %xmm1, %ymm1
vpbroadcastd %xmm2, %ymm2
vpcmpeqd .LC0(%rip), %ymm1, %ymm1
vpblendvb %ymm1, %ymm2, %ymm0, %ymm0
and with -mavx512vl
vpbroadcastd %edi, %ymm1
vpcmpd $0, .LC0(%rip), %ymm1, %k1
vpbroadcastd %esi, %ymm0{%k1}
broadcast-with-mask - heh, would be interesting if we manage
to combine v[idx1] = val; v[idx2] = val; ;)
Now, with SSE4.2 the 16byte case compiles to
setg:
.LFB0:
.cfi_startproc
movd %edi, %xmm3
movdqa %xmm0, %xmm1
movd %esi, %xmm4
pshufd $0, %xmm3, %xmm0
pcmpeqd .LC0(%rip), %xmm0
movdqa %xmm0, %xmm2
pandn %xmm1, %xmm2
pshufd $0, %xmm4, %xmm1
pand %xmm1, %xmm0
por %xmm2, %xmm0
ret
since there's no blend with a variable mask IIRC.
with aarch64 and SVE it doesn't handle the 32byte case at all,
the 16byte case compiles to
setg:
.LFB0:
.cfi_startproc
adrp x2, .LC0
dup v1.4s, w0
dup v2.4s, w1
ldr q3, [x2, #:lo12:.LC0]
cmeq v1.4s, v1.4s, v3.4s
bit v0.16b, v2.16b, v1.16b
which looks equivalent to the AVX2 code.
For all of those varying the vector element type may also
cause "issues" I guess.
> .LC0:
> .long 0
> .long 1
> .long 2
> .long 3
> .long 4
> .long 5
> .long 6
> .long 7
>
> and for powerpc (changing it to 16B vectors, -mcpu=power9) it is
>
> setg:
> addis 9,2,.LC0@toc@ha
> mtvsrws 32,5
> mtvsrws 33,6
> addi 9,9,.LC0@toc@l
> lxv 45,0(9)
> vcmpequw 0,0,13
> xxsel 34,34,33,32
> blr
>
> .LC0:
> .long 0
> .long 1
> .long 2
> .long 3
>
> (We can generate that 0..3 vector without doing loads; I guess x86 can
> do that as well? But it takes more than one insn to do (of course we
> have to set up the memory address first *with* the load, heh).)
>
> For power8 it becomes (we need to splat in separate insns):
>
> setg:
> addis 9,2,.LC0@toc@ha
> mtvsrwz 32,5
> mtvsrwz 33,6
> addi 9,9,.LC0@toc@l
> lxvw4x 45,0,9
> xxspltw 32,32,1
> xxspltw 33,33,1
> vcmpequw 0,0,13
> xxsel 34,34,33,32
> blr
>
>
> Segher