https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82137
Bug ID: 82137 Summary: auto-vectorizing shuffles way to much to avoid duplicate work Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- (same code as bug 82136, but with aligned pointers, and discussing the overall vectorization strategy) static const int aligned = 1; void pairs_double(double blocks[]) { if(aligned) blocks = __builtin_assume_aligned(blocks, 64); for (int i = 0 ; i<10240 ; i+=2) { double x = blocks[i]; double y = blocks[i+1]; blocks[i] = x*y; blocks[i+1] = x+y; } } https://godbolt.org/g/pTY5iU has this + a uint64_t version (with ^ and &) i.e. load a pair of 64-bit elements and replace them with 2 different combinations of the pair. See https://stackoverflow.com/questions/46038401/unpack-m128i-m256i-to-m64-mmx-sse2-avx2. gcc auto-vectorizes this *way* too much shuffling when AVX / AVX2 are available, same for the integer version. It's especially bad with AVX1 only, but even AVX2 is a mess and will totally bottleneck on shuffle throughput on Intel and AMD CPUs (Ryzen has good shuffle throughput, but lane-crossing shuffles are much more expensive than in-lane). gcc 8.0.8 -O3 -march=haswell pairs_double: leaq 81920(%rdi), %rax .L2: vmovapd (%rdi), %ymm1 vmovapd 32(%rdi), %ymm0 addq $64, %rdi vunpcklpd %ymm0, %ymm1, %ymm2 vunpckhpd %ymm0, %ymm1, %ymm0 vpermpd $216, %ymm2, %ymm2 vpermpd $216, %ymm0, %ymm0 vmulpd %ymm2, %ymm0, %ymm1 vaddpd %ymm2, %ymm0, %ymm0 vpermpd $68, %ymm0, %ymm3 vpermpd $238, %ymm0, %ymm0 vpermpd $68, %ymm1, %ymm2 vpermpd $238, %ymm1, %ymm1 vshufpd $12, %ymm3, %ymm2, %ymm2 vshufpd $12, %ymm0, %ymm1, %ymm0 vmovapd %ymm2, -64(%rdi) vmovapd %ymm0, -32(%rdi) cmpq %rdi, %rax jne .L2 vzeroupper ret I haven't worked through the constants to see how its shuffling, but the same function with uint64_t (and operators ^ &) uses 4 shuffles per 2 vectors after the actual work: # vpand / vpxor in ymm0 / ymm1 vpunpcklqdq %ymm0, %ymm1, %ymm2 vpunpckhqdq %ymm0, %ymm1, %ymm0 vperm2i128 $32, %ymm0, %ymm2, %ymm1 vperm2i128 $49, %ymm0, %ymm2, %ymm0 vmovdqa %ymm1, -64(%rdi) vmovdqa %ymm0, -32(%rdi) It uses equivalent shuffles to prepare for vpand/vpxor, so vunpckl/hpd + 2x vperm2f128 should do the same shuffle. Using split stores would help a lot on Intel and AMD: 2x vmovdqa xmm, then 2x vextractf128 xmm. SnB/HSW/SKL vextractf128 to memory is 2 fused-domain uops (Agner Fog's table) but there's no ALU uop. It's just a non-micro-fused store. Since gcc's code horribly bottlenecks on shuffles, split stores would be a win here. It's a huge win on Ryzen, where vperm2f128 is 8 uops with 3c throughput. (-march=ryzen enables -mprefer-avx128, but -march=haswell running on Ryzen is probably a relevant use-case. When all else is equal for Intel, do something that doesn't suck on Ryzen.) ---- However, a different strategy entirely is *MUCH* better for most CPUs, and as a bonus it only needs AVX1 and no lane-crossing shuffles: Do far less shuffling in exchange for doing more mul/add (or whatever), by allowing duplicates. Doing the same operation twice is allowed, even when it can raise FP exceptions (in C and C++). b0 b1 | b2 b3 # load 256b b1 b0 | b3 b2 # vshufpd or vpermilpd for in-lane swap b0*b1 b1*b0 | b2*b3 b3*b2 # vmulpd b0+b1 b1+b0 | b2+b3 b3+b2 # vaddpd b0*b1 b0+b1 | b2*b3 b2+b3 # vblendpd # store 256b Per two 128b pairs, that's 1 load, 1 shuffle, 2x FP math, 1 immediate-blend, 1 store. No lane-crossing shuffles is a huge plus for Ryzen (vperm2f128 is very slow). That's 6 fused-domain uops on Intel, and should easily run at 1 iter per 1.5 cycles (+ loop overhead), without bottlenecking on any ports. The bottleneck is the front-end, so unrolling helps. That's 0.75c per 128b pair. (Slower on SnB/IvB, bottlenecking on load/store throughput.) Immediate-blend can run on any port on HSW+, or p0/p5 on SnB (and good throughput on BD/Ryzen) so it's much more throughput-friendly than using vunpcklpd to mix the mul and add result vectors. If the operation hadn't been commutative (e.g. sub instead of add), you might have to use vpunpck. This same strategy is also optimal for integer booleans with AVX2. (Or possibly even with AVX1, but using VXORPS / VANDPS will bottleneck on port5 on SnB-Broadwell, since it wasn't until Skylake that FP booleans run on p015).