[Bug target/82370] AVX512 can use a memory operand for immediate-count vpsrlw, but gcc doesn't.

peter at cordes dot ca Tue, 03 Oct 2017 18:20:31 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370


--- Comment #2 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Jakub Jelinek from comment #1)
> Created attachment 42296 [details]
> gcc8-pr82370.patch
> 
> If VPAND is exactly as fast as VPANDQ except for different encodings, then
> maybe we can do something like this patch, where we'd use the suffixes only
> for 512-bit vectors, or when any of the operands is %[xy]mm16+, or when
> masking.
> If VPAND is slower, then we could do it for -Os at least.

They're exactly as fast on Skylake-avx512, and no reason to ever expect them to
be slower on any future CPU.

VEX is well-designed and future-compatible because it zeroes out to VLMAX,
whatever that is on the current CPU.  A VEX VPAND can always be decoded to
exactly the same internal uop as a VPANDQ with no masking.  There's no penalty
for mixing VEX and EVEX in general,
 and no reason to expect one (https://stackoverflow.com/q/46080327/224132).

Assemblers already use the VEX encoding whenever possible for FP instructions
like  vandps  15(%rsi), %xmm2, %xmm1  so AVX512VL code will typically contain a
mix of VEX and EVEX.  Related: vpxor %xmm0,%xmm0,%xmm0  is the best way to zero
a ZMM register, saving bytes (and potentially uops on some future AMD-style
CPU).  pr 80636.


KNL doesn't even support AVX512VL, so it can only encode the ZMM version of
VPANDQ.  But according to Agner Fog's testing, VEX VPAND xmm/ymm is the same
tput and latency as EVEX VPANDD/Q.

---

BTW, I was thinking about this again: it might be even better to use
 VMOVDQU8 15(%rsi), %xmm1{k1}{z}    # zero-masking load
Or not: IACA says that it uses an ALU uop (port 0/1/5) as well as a load-port
uop, so it's break-even vs. VPAND except for setting up the constant (probably
movabs $0xaaaaaaaaaaaaaaaa, %rax; kmov %rax, %k1.  Or maybe load the mask from
memory in one instruction, vs. a broadcast-load of a vector constant.)  It
might possibly save power, or it might use more.

A masked load won't fault from masked elements, so it would actually be safe to
vmovdqu8 -1(%rs1), %zmm1{k1}{z}   but performance-wise that's probably not a
win.  It probably still "counts" as crossing a cache-line boundary on most
CPUs.  And probably quite slow if it has to squash an exception for an unmasked
page based on the mask.  (At least VMASKMOVPS is like that.)

For stores, IACA says vmovdqu8 uses an extra ALU uop even with no masking.  gcc
unfortunately uses that when auto-vectorizing the pure C version:
https://godbolt.org/g/f4bJKd

[Bug target/82370] AVX512 can use a memory operand for immediate-count vpsrlw, but gcc doesn't.

Reply via email to