[Bug tree-optimization/119103] New: Very suboptimal AVX2 code generation of simple shift loop

gcc at haasn dot dev via Gcc-bugs Mon, 03 Mar 2025 11:08:03 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119103


            Bug ID: 119103
           Summary: Very suboptimal AVX2 code generation of simple shift
                    loop
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gcc at haasn dot dev
  Target Milestone: ---

== Summary ==

On x86_64 with -mavx2, GCC has a very hard time optimizing a shift by a small
unsigned unknown, even if I add knowledge that the shift amount is sufficiently
small.

In particular, GCC always chooses vpslld instead of vpsllw, and there seems to
be no way to convince it otherwise short of hand written asm or intrinsics.

See demonstration here: https://godbolt.org/z/4YobqhsG4

== Code ==

#include <stdint.h>

void lshift(uint16_t *x, uint8_t amount)
{
    if (amount > 15)
        __builtin_unreachable();

    for (int i = 0; i < 16; i++)
        x[i] <<= amount;
}

== Output of `gcc -O3 -mavx2 -ftree-vectorize` ==

lshift:
        vmovdqu ymm1, YMMWORD PTR [rdi]
        movzx   eax, sil
        vmovq   xmm2, rax
        vpmovzxwd       ymm0, xmm1
        vextracti128    xmm1, ymm1, 0x1
        vpmovzxwd       ymm1, xmm1
        vpslld  ymm0, ymm0, xmm2
        vpslld  ymm1, ymm1, xmm2
        vpxor   xmm2, xmm2, xmm2
        vpblendw        ymm0, ymm2, ymm0, 85
        vpblendw        ymm2, ymm2, ymm1, 85
        vpackusdw       ymm0, ymm0, ymm2
        vpermq  ymm0, ymm0, 216
        vmovdqu YMMWORD PTR [rdi], ymm0
        vzeroupper
        ret

== Expected result ==

lshift:
        vmovdqu ymm1, YMMWORD PTR [rdi]
        movzx   esi, sil
        vmovd   xmm0, esi
        vpsllw  ymm0, ymm1, xmm0
        vmovdqu YMMWORD PTR [rdi], ymm0
        vzeroupper
        ret

Compiled from:

void lshift(uint16_t *x, uint8_t amount)
{
    __m256i data = _mm256_loadu_si256((__m256i *) x);
    __m128i shift_amount = _mm_cvtsi32_si128(amount);
    __m256i shifted = _mm256_sll_epi16(data, shift_amount);
    _mm256_storeu_si256((__m256i *) x, shifted);
}

[Bug tree-optimization/119103] New: Very suboptimal AVX2 code generation of simple shift loop

Reply via email to