https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63351

            Bug ID: 63351
           Summary: Optimization: contract broadcast intrinsics when
                    AVX512 is enabled
           Product: gcc
           Version: 4.9.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: agner at agner dot org

The AVX512 instruction set allows instructions with broadcast, but there are no
corresponding intrinsic functions. The programmer has to write a broadcast
intrinsic followed by some other intrinsic and rely on the compiler to contract
this into a single instruction.

I would expect the optimizer to contract a broadcast intrinsic with any
subsequent intrinsic into a single instruction. For example:

// gcc -Ofast -mavx512f

#include "x86intrin.h"

void dummyz(__m512i a, __m512i b);

void broadcastz(__m512i a, int b) {
    // expect reduction to instruction with broadcast,
    // something like: vpaddd b, %zmm0, %zmm3 {1to16}
    __m512i bb = _mm512_set1_epi32(b);
    __m512i ab = _mm512_add_epi32(a,bb);
    __m512i cc = _mm512_set1_epi32(5);
    __m512i ac = _mm512_add_epi32(a,cc);
    dummyz(ab, ac);
}


This should actually be possible for smaller vector sizes as well when AVX512
is enabled:

void dummyx(__m128 a, __m128 b);

void broadcastx(__m128 a, float b) {
    // broadcasting should even be possible with smaller vectors
    __m128 bb = _mm_set1_ps(b);
    __m128 ab = _mm_add_ps(a,bb);
    __m128 cc = _mm_set1_ps(5.0);
    __m128 ac = _mm_add_ps(a,cc);
    dummyx(ab, ac);
}

Reply via email to