https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63791
Bug ID: 63791 Summary: use 32-byte version of vpbroadcastb on AVX2 platform Product: gcc Version: 4.9.2 Status: UNCONFIRMED Severity: minor Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: marcus.kool at urlfilterdb dot com Created attachment 33926 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33926&action=edit code with _mm256_set1_epi8, _mm256_loadu_si256, _mm256_cmpeq_epi8, _mm256_movemask_epi8 With gcc 4.9.2 and compile options -std=c99 -mavx2 -mbmi -mbmi2 -O3 -fno-tree-vectorize on an Intel Haswell CPU the intrinsic function _mm256_set1_epi8() generates 3 instructions while it could do better with only 2 instructions. Generated code is either vmovd reg, xmmreg vpbroadcastb xmmreg, xmmreg vinserti128 $1, xmmreg, ymmreg, ymmreg or vmovd reg, xmmreg vpbroadcastb xmmreg, xmmreg vperm2i128 $0, ymmreg, ymmreg, ymmreg But it could generate faster code instead: vmovd reg, xmmreg vpbroadcastb xmmreg, ymmreg Example C source is in the attachment.