https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119368

            Bug ID: 119368
           Summary: immintrin code running slower with gcc than clang
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

as mentioned in
https://www.root.cz/clanky/instrukcni-sady-simd-a-automaticke-vektorizace-provadene-prekladacem-gcc/nazory/#newIndex1

the following code runs faster when compiled by clang

int product(const char *a, const char *b)
{
    __m512i sum = _mm512_setzero_si512();

    for (size_t i = 0; i < 256; i += 64)
    {
        __m512i la = _mm512_loadu_si512(reinterpret_cast<const __m512i
*>(&a[i]));
        __m512i lb = _mm512_loadu_si512(reinterpret_cast<const __m512i
*>(&b[i]));

        __m512i a_low = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(la));
        __m512i b_low = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(lb));
        __m512i mul_low = _mm512_madd_epi16(a_low, b_low);

        __m512i a_high = _mm512_cvtepi8_epi16(_mm512_extracti32x8_epi32(la,
1));
        __m512i b_high = _mm512_cvtepi8_epi16(_mm512_extracti32x8_epi32(lb,
1));
        __m512i mul_high = _mm512_madd_epi16(a_high, b_high);

        sum = _mm512_add_epi32(sum, mul_low);
        sum = _mm512_add_epi32(sum, mul_high);
    }

    return _mm512_reduce_add_epi32(sum);
}

https://godbolt.org/z/d4oE11red

It is due to splitting 512bit loads to 256bit loads (vpmovsxbw)

Reply via email to