https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68928
Bug ID: 68928 Summary: AVX loops on unaligned arrays could generate more efficient startup/cleanup code when peeling Product: gcc Version: 5.3.0 Status: UNCONFIRMED Keywords: missed-optimization, ssemmx Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86-64-*-* I have some suggestions for better code that gcc could use for the prologue/epilogue when vectorizing loops over unaligned buffers. I haven't looked at gcc's code, just the output, so IDK how one might get gcc to implement these. --------- Consider the following code: #include <immintrin.h> typedef float float_align32 __attribute__ ((aligned (32))); void floatmul_aligned(float_align32 *a) { for (int i=0; i<1024 ; i++) a[i] *= 2; } void floatmul(float *a) { for (int i=0; i<1024 ; i++) a[i] *= 2; } g++ 5.3.0 -O3 -march=sandybridge emits what you'd expect for the aligned version: floatmul_aligned(float*): leaq 4096(%rdi), %rax .L2: vmovaps (%rdi), %ymm0 addq $32, %rdi vaddps %ymm0, %ymm0, %ymm0 vmovaps %ymm0, -32(%rdi) cmpq %rdi, %rax jne .L2 vzeroupper ret *** off-topic *** It unfortunately uses 5 uops in the loop, meaning it can only issue one iteration per 2 clocks. Other than unrolling, it would prob. be more efficient to get 2.0f broadcast into %ymm1 and use vmulps (%rdi), %ymm1, %ymm0, avoiding the separate load. Doing the loop in reverse order, with an indexed addressing mode counting an index down to zero, would also keep the loop overhead down to one decrement-and-branch uop. I know compilers are allowed to re-order memory accesses, so I assume this would be allowed. However, this wouldn't actually help on Sandybridge since it seems that two-register addressing modes might not micro-fuse on SnB-family CPUs: (http://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes. Agner Fog says he tested and found 2-reg addressing modes did micro-fuse. Agner Fog is probably right, but IDK what's wrong with my experiment using perf counters.) That would make the store 2 uops. *** back on topic *** Anyway, that wasn't even what I meant to report. The unaligned case peels off the potentially-unaligned start/end iterations, and unrolls them into a giant amount of code. This is unlikely to be optimal outside of microbenchmarks, since CPUs with a uop-cache suffer from excessive unrolling. floatmul(float*): movq %rdi, %rax andl $31, %eax shrq $2, %rax negq %rax andl $7, %eax je .L12 vmovss (%rdi), %xmm0 vaddss %xmm0, %xmm0, %xmm0 vmovss %xmm0, (%rdi) cmpl $1, %eax je .L13 vmovss 4(%rdi), %xmm0 vaddss %xmm0, %xmm0, %xmm0 vmovss %xmm0, 4(%rdi) cmpl $2, %eax je .L14 vmovss 8(%rdi), %xmm0 ... repeated up to cmpl $6, %eax ... some loop setup .L9: vmovaps (%rcx,%rax), %ymm0 addl $1, %edx vaddps %ymm0, %ymm0, %ymm0 vmovaps %ymm0, (%rcx,%rax) addq $32, %rax cmpl %esi, %edx jb .L9 ... another fully-unrolled up-to-7 iteration cleanup loop Notice that the vectorized part of the loop now has 6 uops. (Or 7, if the store can't micro-fuse.) So gcc is even farther from getting this loop to run at one cycle per iteration. (Which should be possible on Haswell. On SnB/IvB (and AMD Bulldozer-family), a 256b store takes two cycles anyway.) Is there any experimental evidence that fully unrolling to make this much code is beneficial? The most obvious way to improve on this would be to use a 128b xmm vector for the first 4 iterations of the prologue/epilogue loops. Even simply not unrolling the 7-iteration alignment loops might be a win. Every unrolled iteration still has a compare-and-branch. By counting down to zero, the loop could have the same overhead. All that changes is branch prediction (one taken branch and many not-taken, vs. a single loop branch taken n times.) AVX introduces a completely different way to handle this, though: VMASKMOVPS is usable now, since it doesn't have the non-temporal hint that makes the SSE version of it nearly useless. According to Agner Fog's insn tables, vpmaskmov %ymm, %ymm, m256 is only 4 uops, and has a throughput of one per 2 cycles (SnB/IvB/Haswell). It's quite slow (as a store) on AMD bulldozer-family CPUs, though, so this might only be appropriate with -tune=something other than AMD. The trouble is turning a misalignment count into a mask. Most of the useful instructions (like PSRLDQ to use on a vector of all-ones) are only available with immediate counts. Keeping an array of 7 256b masks seems like a big waste, and having a switch() to run the byte-shift instruction with one of 7 different immediate operands also sucks. (And doesn't work because it work in-lane, not across both 256b lanes). PSLLQ can take a shift count in the low qword of an xmm register, but I'm not sure it helps. My best idea for generating a mask for VMASKMOVPS requires AVX2: broadcast the misalignment count to all bytes of a ymm register (VPBROADCASTB). Use VPCMPGTB with another 256b constant (LSB first): { 7,7,7,7, 6,6,6,6, 5,5,5,5, 4,4,4,4, 3,3,3,3, 2,2,2,2, 1,1,1,1, 0,0,0,0 }. The least significant 4B will always be 0, since the count is in the 0-7 range, and 7>7 is false. (Or put the count into the 1-8 range so we do 32B of useful work in the aligned case, instead of zero, without branching.) The constant could shrink to 64b {7,6,5,4,3,2,1,0}, with an extra instruction: vmovd %eax, %xmm0 # %eax = misalignment count vpbroadcastb %xmm0, %xmm0 # broadcast the low byte to 128b vpcmpgtb .constant, %xmm0, %xmm1 # make sure the 64b .constant is at least 128b from the end of a cache line # Our mask is in the low 8 bytes of %xmm1. The upper 64b is garbage vpmovsxbd %xmm1, %ymm1 # all-ones sign-extends to all-ones Or if reading "garbage" is too hacky, expand to 256b sooner: vmovd %eax, %xmm0 # %eax = misalignment count vpbroadcastd %xmm0, %ymm0 # broadcast the low 32b vpmovzxbd .constant, %ymm1 # 2 uops, unlike the other forms. vpcmpgtd %ymm1, %ymm0, %ymm2 Haswell: first version is movd: 1uop(p5). 1/1. vpbroadcastb x,x: 1uop(p5). 1/1 vpcmpgtb x,mem: 1uop(p15). 1/0.5 vpmovzxbd y,x: 1uop(p5), 3/1. total: 4 uops, 6c latency, 1 per 2c throughput (bottleneck on p5). second version: movd: 1uop. 1/1. vpbroadcastb y,x: 1uop. 3/1. vpcmpgtd y,y: 1uop. 1/0.5 vpmovzxbd ymm, m64: 2uops(p5+p23), ?/1. total: 5 uops, 5c latency, 1 per 2c throughput (saturating p5). The movzx is off the critical path, but can't micro-fuse. Without AVX2: #### untested / possibly buggy imul $0x1010101, %edx, %eax # broadcast the misalignment count to the 4 bytes of %eax vmovd %eax, %xmm0 vpshufd $0, %xmm0, %xmm0 # broadcast to all bytes of xmm0 vpcmpgtb .constant2, %xmm0, %xmm1 # constant2 = .byte 0,0, 1,1, ... vpmovsxwd %xmm1, %xmm2 # [32b:count>0, 32b:count>1, ...] vpunpckhwd %xmm1, %xmm1, %xmm3 # [32b:count>4, 32b:count>5, ...] vinsertf128 $1, %xmm2, %ymm3, %ymm3 # combine the masks # do a masked load, too, to avoid possible NaN slowdowns if there's garbage before the array vmaskmov (%rdi), %ymm2, %ymm3 # rdi = p & ~0x1F: rounded down to previous 32B boundary vaddps %ymm3, %ymm3, %ymm3 vmaskmov %ymm3, %ymm2, (%rdi) mainloop: ... # epilogue: invert the mask Or skip the vinsertf and do two separate vaddps xmm / vmaskmov, but that's probably worse. The addresses used with vmaskmov will be aligned: its the masking that takes care of unaligned accesses. Also, we could broadcast the count just to words, and use vpcmpgtw. We're using words rather than bytes because there's no single instruction to pmovsxbd from the 2nd 32b chunk of a source register (to unpack the high half of the mask). punpckh same,same can read from the high half of a reg and double the size of the elements, though. An extra pmovsxbw or punpcklbw would let us use a 64b constant, though. Anyway, IDK if this idea is generally useful for gcc to handle arrays that aren't guaranteed to be aligned. Probably things should be arranged so that in the aligned case, either the mask generation and vmaskmovps are skipped altogether, or that the vmaskmovps does a full 256b load/store, rather than a fully-masked 0-byte load/store. Besides the obvious reason of avoiding wasted work, AMD Jaguar's VMASKMOVPS takes ~300 clocks for a load with mask=0, vs. 2 clocks (15 cycle latency) in the normal case. VMASKMOVPS 256b store on Jaguar has one per 22c throughput, though, and takes 36m-ops. So it's not worth using if targeting jaguar, but avoiding Jaguar's catastrophic case is a good idea even when tuning for something else, since it's probably a good idea anyway.