https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66369
Bug ID: 66369 Summary: gcc 4.8.3/5.1.0 miss optimisation with vpmovmskb Product: gcc Version: 4.8.3 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: marcus.kool at urlfilterdb dot com Target Milestone: --- Created attachment 35672 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35672&action=edit example C code to demonstrate the missed optimisation in gcc 4.8.3 and 5.1.0 When using _mm256_movemask_epi8() I cannot find a way for gcc to produce vpmovmskb YMM,R64 instead of vpmovmskb YMM,R32 When the result of the vpmovmskb is not stored in R64, unnecessary sign-extension instructions cltq, movl or movslq are generated later. With a result in R32 and indexing an array of structs, gcc generates for node = node->children[ __builtin_ctzl(result-of-vpmovmskb) ] the following: vpmovmskb YMM,R32 movslq R32, R64 tzcntq R64, R64 movq offset(%rdi,R64,8), %rdi instead of the more efficient: vpmovmskb YMM,R64 tzcntq R64,R64 movq offset(%rdi,R64,8), %rdi Attached is avx2.c which has the C source code that demonstrates the above. aavx2.c is compiled with gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9) and flags -std=c99 -march=core-avx2 -mtune=core-avx2 -O3 gcc 5.1.0 has the same behaviour.