https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82356
Bug ID: 82356 Summary: auto-vectorizing pack of 16->8 has a redundant AND after a shift Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization, ssemmx Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* #include <stdint.h> void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t *__restrict__ src, size_t bytes) { uint8_t *end_dst = dst + bytes; do{ *dst++ = *src++ >> 8; } while(dst < end_dst); } https://godbolt.org/g/yoJZ3C gcc -O3 auto-vectorizes to this inner loop: .L5: movdqa (%rdx,%rax,2), %xmm0 movdqa 16(%rdx,%rax,2), %xmm1 psrlw $8, %xmm0 pand %xmm2, %xmm0 # Redundant with the shift psrlw $8, %xmm1 pand %xmm2, %xmm1 # Redundant with the shift packuswb %xmm1, %xmm0 movups %xmm0, (%rcx,%rax) addq $16, %rax cmpq %rsi, %rax jne .L5 This is mostly good, but the PAND instructions are redundant, because psrlw by 8 already leaves the high byte of each 16-bit element zeroed. The same extra AND is present when auto-vectorizing for AVX2 (but not AVX512, where it uses a different strategy.) Other than that, the AVX2 vectorization strategy looks very good (packus ymm, then vpermq to fix the result). If the input is 32B-aligned (or 64B aligned for AVX2), one of the load+shifts can be *replaced* with an AND + unaligned load offset by -1. This avoids a bottleneck on shift throughput (at least with unrolling it does; without unrolling we bottleneck on the front-end except on Ryzen). It's even better with AVX, because load+AND can fold into one instruction. See https://stackoverflow.com/a/46477080/224132 for more details. This C source produces the inner loop that I think should be very good across K10, Bulldozer, Ryzen, Nehalem, Sandybridge, HSW/SKL, Jaguar, Atom, and Silvermont. (With SSE2 or AVX.) i.e. this should be great for tune=generic after reaching a 32B boundary. Not great on Core2 or K8 where non-cache-line-crossing movdqu costs more. // take both args as uint8_t* so we can offset by 1 byte to replace a shift with an AND // if src is 32B-aligned, we never have cache-line splits void pack_high8_alignhack(uint8_t *restrict dst, const uint8_t *restrict src, size_t bytes) { uint8_t *end_dst = dst + bytes; do{ __m128i v0 = _mm_loadu_si128((__m128i*)src); // this load should be aligned __m128i v1_offset = _mm_loadu_si128(1+(__m128i*)(src-1)); v0 = _mm_srli_epi16(v0, 8); __m128i v1 = _mm_and_si128(v1_offset, _mm_set1_epi16(0x00FF)); __m128i pack = _mm_packus_epi16(v0, v1); _mm_storeu_si128((__m128i*)dst, pack); dst += 16; src += 32; // 32 bytes } while(dst < end_dst); }