On Fri, Sep 25, 2020 at 08:58:35AM +0200, Richard Biener wrote:
> On Thu, Sep 24, 2020 at 9:38 PM Segher Boessenkool
> <[email protected]> wrote:
> > after which I get (-march=znver2)
> >
> > setg:
> > vmovd %edi, %xmm1
> > vmovd %esi, %xmm2
> > vpbroadcastd %xmm1, %ymm1
> > vpbroadcastd %xmm2, %ymm2
> > vpcmpeqd .LC0(%rip), %ymm1, %ymm1
> > vpandn %ymm0, %ymm1, %ymm0
> > vpand %ymm2, %ymm1, %ymm1
> > vpor %ymm0, %ymm1, %ymm0
> > ret
>
> I get with -march=znver2 -O2
>
> vmovd %edi, %xmm1
> vmovd %esi, %xmm2
> vpbroadcastd %xmm1, %ymm1
> vpbroadcastd %xmm2, %ymm2
> vpcmpeqd .LC0(%rip), %ymm1, %ymm1
> vpblendvb %ymm1, %ymm2, %ymm0, %ymm0
Ah, maybe my x86 compiler it too old...
x86_64-linux-gcc (GCC) 10.0.0 20190919 (experimental)
not exactly old, huh. I wonder what I do wrong then.
> Now, with SSE4.2 the 16byte case compiles to
>
> setg:
> .LFB0:
> .cfi_startproc
> movd %edi, %xmm3
> movdqa %xmm0, %xmm1
> movd %esi, %xmm4
> pshufd $0, %xmm3, %xmm0
> pcmpeqd .LC0(%rip), %xmm0
> movdqa %xmm0, %xmm2
> pandn %xmm1, %xmm2
> pshufd $0, %xmm4, %xmm1
> pand %xmm1, %xmm0
> por %xmm2, %xmm0
> ret
>
> since there's no blend with a variable mask IIRC.
PowerPC got at least *that* right since time immemorial :-)
> with aarch64 and SVE it doesn't handle the 32byte case at all,
> the 16byte case compiles to
>
> setg:
> .LFB0:
> .cfi_startproc
> adrp x2, .LC0
> dup v1.4s, w0
> dup v2.4s, w1
> ldr q3, [x2, #:lo12:.LC0]
> cmeq v1.4s, v1.4s, v3.4s
> bit v0.16b, v2.16b, v1.16b
>
> which looks equivalent to the AVX2 code.
Yes, and we can do pretty much the same on Power, too.
> For all of those varying the vector element type may also
> cause "issues" I guess.
For us, as long as it stays 16B vectors, all should be fine. There may
be issues in the compiler, but at least the hardware has no problem with
it ;-)
> > and for powerpc (changing it to 16B vectors, -mcpu=power9) it is
> >
> > setg:
> > addis 9,2,.LC0@toc@ha
> > mtvsrws 32,5
> > mtvsrws 33,6
> > addi 9,9,.LC0@toc@l
> > lxv 45,0(9)
> > vcmpequw 0,0,13
> > xxsel 34,34,33,32
> > blr
The -mcpu=power10 code right now is just
plxv 45,.LC0@pcrel
mtvsrws 32,5
mtvsrws 33,6
vcmpequw 0,0,13
xxsel 34,34,33,32
blr
(exactly the same, but less memory address setup cost), so doing
something like this as a generic version would work quite well pretty
much everywhere I think!
Segher