https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93395
Bug ID: 93395
Summary: AVX2 missed optimization : _mm256_permute_pd() is
unfortunately translated into the more expensive
VPERMPD instead of the cheap VPERMILPD
Product: gcc
Version: 9.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: nathanael.schaeffer at gmail dot com
Target Milestone: ---
According to Agner Fog's instruction timing tables
and to my own measurements, VPERMPD has a 3 cycle latency, while VPERMILPD has
a 1 cycle latency on most CPUs.
Yet, the intrinsic _mm256_permute_pd() is always translated with VPERMPD, while
this intrinsic maps directly to the VPERMILPD instruction.
This makes the code SLOWER.
It should be the opposite: the _mm256_permute4x64_pd() intrinsic, which maps
the VPERMPD instruction should, when possible, be translated into VPERMILPD.
Note that clang does the right thing here.
The same problem arises for AVX-512.
See assembly generated here: https://godbolt.org/z/VZe8qk
I replicate the code here for completeness:
#include <immintrin.h>
// translated into "vpermpd ymm0, ymm0, 177"
// which is OK, but "vpermilpd ymm0, ymm0, 5" does the same thing faster.
__m256d perm_missed_optimization(__m256d a) {
return _mm256_permute4x64_pd(a,0xB1);
}
// translated into "vpermpd ymm0, ymm0, 177"
// which is 3 times slower than the original intent of "vpermilpd ymm0, ymm0,
5"
__m256d perm_pessimization(__m256d a) {
return _mm256_permute_pd(a,0x5);
}
// adequately translated into "vshufpd ymm0, ymm0, ymm0, 5"
// which does the same as "vpermilpd ymm0, ymm0, 5" at the same speed.
__m256d perm_workaround(__m256d a) {
return _mm256_shuffle_pd(a, a, 5);
}