5.1.0 miss optimisation with vpmovmskb

marcus.kool at urlfilterdb dot com Mon, 01 Jun 2015 14:54:07 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66369


            Bug ID: 66369
           Summary: gcc 4.8.3/5.1.0 miss optimisation with vpmovmskb
           Product: gcc
           Version: 4.8.3
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: marcus.kool at urlfilterdb dot com
  Target Milestone: ---

Created attachment 35672
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35672&action=edit
example C code to demonstrate the missed optimisation in gcc 4.8.3 and 5.1.0

When using _mm256_movemask_epi8() I cannot find a way for gcc to produce
   vpmovmskb YMM,R64
instead of 
   vpmovmskb YMM,R32

When the result of the vpmovmskb is not stored in R64, unnecessary
sign-extension instructions cltq, movl or movslq are generated later.  With a
result in R32 and indexing an array of structs, gcc generates for 
   node = node->children[ __builtin_ctzl(result-of-vpmovmskb) ]
the following:
   vpmovmskb YMM,R32
   movslq    R32, R64
   tzcntq    R64, R64
   movq      offset(%rdi,R64,8), %rdi
instead of the more efficient:
   vpmovmskb YMM,R64
   tzcntq    R64,R64
   movq      offset(%rdi,R64,8), %rdi

Attached is avx2.c which has the C source code that demonstrates the above.
aavx2.c is compiled with gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9) and flags
   -std=c99 -march=core-avx2  -mtune=core-avx2 -O3
gcc 5.1.0 has the same behaviour.

[Bug c/66369] New: gcc 4.8.3/5.1.0 miss optimisation with vpmovmskb

Reply via email to