http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52568
Bug #: 52568 Summary: suboptimal __builtin_shuffle on cycles with AVX Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: marc.gli...@normalesup.org Hello, I compiled the following with -O3 (or -Os) and -mavx #include <x86intrin.h> __m256d left(__m256d x){ __m256i mask={1,2,3,0}; return __builtin_shuffle(x,mask); } (by the way, for some reason, gcc insists that 'mask' is set but not used with -Wall) and got: vunpckhpd %xmm0, %xmm0, %xmm3 vmovapd %xmm0, %xmm1 vextractf128 $0x1, %ymm0, %xmm0 vmovaps %xmm0, %xmm2 vunpckhpd %xmm0, %xmm0, %xmm0 vunpcklpd %xmm1, %xmm0, %xmm1 vunpcklpd %xmm2, %xmm3, %xmm0 vinsertf128 $0x1, %xmm1, %ymm0, %ymm0 ret That doesn't really match the code I currently use to do this: #ifdef __AVX2__ __m256d d=_mm256_permute4x64_pd(x,1+2*4+3*16+0*64); #else __m256d b=_mm256_shuffle_pd(x,x,5); __m256d c=_mm256_permute2f128_pd(b,b,1); __m256d d=_mm256_blend_pd(b,c,10); #endif Could something recognizing this permutation pattern (and the right cyclic shift) be added? I know there are too many shuffles to hand-code them all, but cycles seem like they shouldn't be too uncommon. With -mavx2, I get a single vpermq, which is close enough to the expected vpermpd.