https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63791

            Bug ID: 63791
           Summary: use 32-byte version of vpbroadcastb on AVX2 platform
           Product: gcc
           Version: 4.9.2
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: marcus.kool at urlfilterdb dot com

Created attachment 33926
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33926&action=edit
code with _mm256_set1_epi8, _mm256_loadu_si256, _mm256_cmpeq_epi8,
_mm256_movemask_epi8

With gcc 4.9.2 and compile options 
-std=c99 -mavx2 -mbmi -mbmi2 -O3 -fno-tree-vectorize
on an Intel Haswell CPU
the intrinsic function _mm256_set1_epi8() generates 3 instructions while it
could do better with only 2 instructions.

Generated code is either
   vmovd         reg, xmmreg
   vpbroadcastb  xmmreg, xmmreg
   vinserti128   $1, xmmreg, ymmreg, ymmreg
or
   vmovd         reg, xmmreg
   vpbroadcastb  xmmreg, xmmreg
   vperm2i128    $0, ymmreg, ymmreg, ymmreg

But it could generate faster code instead:
   vmovd         reg, xmmreg
   vpbroadcastb  xmmreg, ymmreg

Example C source is in the attachment.

Reply via email to