https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119103
Bug ID: 119103 Summary: Very suboptimal AVX2 code generation of simple shift loop Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: gcc at haasn dot dev Target Milestone: --- == Summary == On x86_64 with -mavx2, GCC has a very hard time optimizing a shift by a small unsigned unknown, even if I add knowledge that the shift amount is sufficiently small. In particular, GCC always chooses vpslld instead of vpsllw, and there seems to be no way to convince it otherwise short of hand written asm or intrinsics. See demonstration here: https://godbolt.org/z/4YobqhsG4 == Code == #include <stdint.h> void lshift(uint16_t *x, uint8_t amount) { if (amount > 15) __builtin_unreachable(); for (int i = 0; i < 16; i++) x[i] <<= amount; } == Output of `gcc -O3 -mavx2 -ftree-vectorize` == lshift: vmovdqu ymm1, YMMWORD PTR [rdi] movzx eax, sil vmovq xmm2, rax vpmovzxwd ymm0, xmm1 vextracti128 xmm1, ymm1, 0x1 vpmovzxwd ymm1, xmm1 vpslld ymm0, ymm0, xmm2 vpslld ymm1, ymm1, xmm2 vpxor xmm2, xmm2, xmm2 vpblendw ymm0, ymm2, ymm0, 85 vpblendw ymm2, ymm2, ymm1, 85 vpackusdw ymm0, ymm0, ymm2 vpermq ymm0, ymm0, 216 vmovdqu YMMWORD PTR [rdi], ymm0 vzeroupper ret == Expected result == lshift: vmovdqu ymm1, YMMWORD PTR [rdi] movzx esi, sil vmovd xmm0, esi vpsllw ymm0, ymm1, xmm0 vmovdqu YMMWORD PTR [rdi], ymm0 vzeroupper ret Compiled from: void lshift(uint16_t *x, uint8_t amount) { __m256i data = _mm256_loadu_si256((__m256i *) x); __m128i shift_amount = _mm_cvtsi32_si128(amount); __m256i shifted = _mm256_sll_epi16(data, shift_amount); _mm256_storeu_si256((__m256i *) x, shifted); }