https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63351
Bug ID: 63351 Summary: Optimization: contract broadcast intrinsics when AVX512 is enabled Product: gcc Version: 4.9.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: agner at agner dot org The AVX512 instruction set allows instructions with broadcast, but there are no corresponding intrinsic functions. The programmer has to write a broadcast intrinsic followed by some other intrinsic and rely on the compiler to contract this into a single instruction. I would expect the optimizer to contract a broadcast intrinsic with any subsequent intrinsic into a single instruction. For example: // gcc -Ofast -mavx512f #include "x86intrin.h" void dummyz(__m512i a, __m512i b); void broadcastz(__m512i a, int b) { // expect reduction to instruction with broadcast, // something like: vpaddd b, %zmm0, %zmm3 {1to16} __m512i bb = _mm512_set1_epi32(b); __m512i ab = _mm512_add_epi32(a,bb); __m512i cc = _mm512_set1_epi32(5); __m512i ac = _mm512_add_epi32(a,cc); dummyz(ab, ac); } This should actually be possible for smaller vector sizes as well when AVX512 is enabled: void dummyx(__m128 a, __m128 b); void broadcastx(__m128 a, float b) { // broadcasting should even be possible with smaller vectors __m128 bb = _mm_set1_ps(b); __m128 ab = _mm_add_ps(a,bb); __m128 cc = _mm_set1_ps(5.0); __m128 ac = _mm_add_ps(a,cc); dummyx(ab, ac); }