https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93395

            Bug ID: 93395
           Summary: AVX2 missed optimization : _mm256_permute_pd() is
                    unfortunately translated into the more expensive
                    VPERMPD instead of the cheap VPERMILPD
           Product: gcc
           Version: 9.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: nathanael.schaeffer at gmail dot com
  Target Milestone: ---

According to Agner Fog's instruction timing tables
and to my own measurements, VPERMPD has a 3 cycle latency, while VPERMILPD has
a 1 cycle latency on most CPUs.

Yet, the intrinsic _mm256_permute_pd() is always translated with VPERMPD, while
this intrinsic maps directly to the VPERMILPD instruction.
This makes the code SLOWER.

It should be the opposite: the _mm256_permute4x64_pd() intrinsic, which maps
the VPERMPD instruction should, when possible, be translated into VPERMILPD.

Note that clang does the right thing here.

The same problem arises for AVX-512.

See assembly generated here: https://godbolt.org/z/VZe8qk

I replicate the code here for completeness:

#include <immintrin.h>


// translated into   "vpermpd ymm0, ymm0, 177"
// which is OK, but  "vpermilpd ymm0, ymm0, 5"   does the same thing faster.
__m256d perm_missed_optimization(__m256d a) {
    return _mm256_permute4x64_pd(a,0xB1);
}

// translated into   "vpermpd ymm0, ymm0, 177"
// which is 3 times slower than the original intent of   "vpermilpd ymm0, ymm0,
5"
__m256d perm_pessimization(__m256d a) {
    return _mm256_permute_pd(a,0x5);
}

// adequately translated into  "vshufpd ymm0, ymm0, ymm0, 5"
// which does the same as   "vpermilpd ymm0, ymm0, 5"   at the same speed.
__m256d perm_workaround(__m256d a) {
    return _mm256_shuffle_pd(a, a, 5);
}

Reply via email to