https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119368
Bug ID: 119368 Summary: immintrin code running slower with gcc than clang Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- as mentioned in https://www.root.cz/clanky/instrukcni-sady-simd-a-automaticke-vektorizace-provadene-prekladacem-gcc/nazory/#newIndex1 the following code runs faster when compiled by clang int product(const char *a, const char *b) { __m512i sum = _mm512_setzero_si512(); for (size_t i = 0; i < 256; i += 64) { __m512i la = _mm512_loadu_si512(reinterpret_cast<const __m512i *>(&a[i])); __m512i lb = _mm512_loadu_si512(reinterpret_cast<const __m512i *>(&b[i])); __m512i a_low = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(la)); __m512i b_low = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(lb)); __m512i mul_low = _mm512_madd_epi16(a_low, b_low); __m512i a_high = _mm512_cvtepi8_epi16(_mm512_extracti32x8_epi32(la, 1)); __m512i b_high = _mm512_cvtepi8_epi16(_mm512_extracti32x8_epi32(lb, 1)); __m512i mul_high = _mm512_madd_epi16(a_high, b_high); sum = _mm512_add_epi32(sum, mul_low); sum = _mm512_add_epi32(sum, mul_high); } return _mm512_reduce_add_epi32(sum); } https://godbolt.org/z/d4oE11red It is due to splitting 512bit loads to 256bit loads (vpmovsxbw)