https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119103
Bug ID: 119103
Summary: Very suboptimal AVX2 code generation of simple shift
loop
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: gcc at haasn dot dev
Target Milestone: ---
== Summary ==
On x86_64 with -mavx2, GCC has a very hard time optimizing a shift by a small
unsigned unknown, even if I add knowledge that the shift amount is sufficiently
small.
In particular, GCC always chooses vpslld instead of vpsllw, and there seems to
be no way to convince it otherwise short of hand written asm or intrinsics.
See demonstration here: https://godbolt.org/z/4YobqhsG4
== Code ==
#include <stdint.h>
void lshift(uint16_t *x, uint8_t amount)
{
if (amount > 15)
__builtin_unreachable();
for (int i = 0; i < 16; i++)
x[i] <<= amount;
}
== Output of `gcc -O3 -mavx2 -ftree-vectorize` ==
lshift:
vmovdqu ymm1, YMMWORD PTR [rdi]
movzx eax, sil
vmovq xmm2, rax
vpmovzxwd ymm0, xmm1
vextracti128 xmm1, ymm1, 0x1
vpmovzxwd ymm1, xmm1
vpslld ymm0, ymm0, xmm2
vpslld ymm1, ymm1, xmm2
vpxor xmm2, xmm2, xmm2
vpblendw ymm0, ymm2, ymm0, 85
vpblendw ymm2, ymm2, ymm1, 85
vpackusdw ymm0, ymm0, ymm2
vpermq ymm0, ymm0, 216
vmovdqu YMMWORD PTR [rdi], ymm0
vzeroupper
ret
== Expected result ==
lshift:
vmovdqu ymm1, YMMWORD PTR [rdi]
movzx esi, sil
vmovd xmm0, esi
vpsllw ymm0, ymm1, xmm0
vmovdqu YMMWORD PTR [rdi], ymm0
vzeroupper
ret
Compiled from:
void lshift(uint16_t *x, uint8_t amount)
{
__m256i data = _mm256_loadu_si256((__m256i *) x);
__m128i shift_amount = _mm_cvtsi32_si128(amount);
__m256i shifted = _mm256_sll_epi16(data, shift_amount);
_mm256_storeu_si256((__m256i *) x, shifted);
}