https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370
--- Comment #2 from Peter Cordes <peter at cordes dot ca> --- (In reply to Jakub Jelinek from comment #1) > Created attachment 42296 [details] > gcc8-pr82370.patch > > If VPAND is exactly as fast as VPANDQ except for different encodings, then > maybe we can do something like this patch, where we'd use the suffixes only > for 512-bit vectors, or when any of the operands is %[xy]mm16+, or when > masking. > If VPAND is slower, then we could do it for -Os at least. They're exactly as fast on Skylake-avx512, and no reason to ever expect them to be slower on any future CPU. VEX is well-designed and future-compatible because it zeroes out to VLMAX, whatever that is on the current CPU. A VEX VPAND can always be decoded to exactly the same internal uop as a VPANDQ with no masking. There's no penalty for mixing VEX and EVEX in general, and no reason to expect one (https://stackoverflow.com/q/46080327/224132). Assemblers already use the VEX encoding whenever possible for FP instructions like vandps 15(%rsi), %xmm2, %xmm1 so AVX512VL code will typically contain a mix of VEX and EVEX. Related: vpxor %xmm0,%xmm0,%xmm0 is the best way to zero a ZMM register, saving bytes (and potentially uops on some future AMD-style CPU). pr 80636. KNL doesn't even support AVX512VL, so it can only encode the ZMM version of VPANDQ. But according to Agner Fog's testing, VEX VPAND xmm/ymm is the same tput and latency as EVEX VPANDD/Q. --- BTW, I was thinking about this again: it might be even better to use VMOVDQU8 15(%rsi), %xmm1{k1}{z} # zero-masking load Or not: IACA says that it uses an ALU uop (port 0/1/5) as well as a load-port uop, so it's break-even vs. VPAND except for setting up the constant (probably movabs $0xaaaaaaaaaaaaaaaa, %rax; kmov %rax, %k1. Or maybe load the mask from memory in one instruction, vs. a broadcast-load of a vector constant.) It might possibly save power, or it might use more. A masked load won't fault from masked elements, so it would actually be safe to vmovdqu8 -1(%rs1), %zmm1{k1}{z} but performance-wise that's probably not a win. It probably still "counts" as crossing a cache-line boundary on most CPUs. And probably quite slow if it has to squash an exception for an unmasked page based on the mask. (At least VMASKMOVPS is like that.) For stores, IACA says vmovdqu8 uses an extra ALU uop even with no masking. gcc unfortunately uses that when auto-vectorizing the pure C version: https://godbolt.org/g/f4bJKd