https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93395
Bug ID: 93395 Summary: AVX2 missed optimization : _mm256_permute_pd() is unfortunately translated into the more expensive VPERMPD instead of the cheap VPERMILPD Product: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: nathanael.schaeffer at gmail dot com Target Milestone: --- According to Agner Fog's instruction timing tables and to my own measurements, VPERMPD has a 3 cycle latency, while VPERMILPD has a 1 cycle latency on most CPUs. Yet, the intrinsic _mm256_permute_pd() is always translated with VPERMPD, while this intrinsic maps directly to the VPERMILPD instruction. This makes the code SLOWER. It should be the opposite: the _mm256_permute4x64_pd() intrinsic, which maps the VPERMPD instruction should, when possible, be translated into VPERMILPD. Note that clang does the right thing here. The same problem arises for AVX-512. See assembly generated here: https://godbolt.org/z/VZe8qk I replicate the code here for completeness: #include <immintrin.h> // translated into "vpermpd ymm0, ymm0, 177" // which is OK, but "vpermilpd ymm0, ymm0, 5" does the same thing faster. __m256d perm_missed_optimization(__m256d a) { return _mm256_permute4x64_pd(a,0xB1); } // translated into "vpermpd ymm0, ymm0, 177" // which is 3 times slower than the original intent of "vpermilpd ymm0, ymm0, 5" __m256d perm_pessimization(__m256d a) { return _mm256_permute_pd(a,0x5); } // adequately translated into "vshufpd ymm0, ymm0, ymm0, 5" // which does the same as "vpermilpd ymm0, ymm0, 5" at the same speed. __m256d perm_workaround(__m256d a) { return _mm256_shuffle_pd(a, a, 5); }