https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121296

            Bug ID: 121296
           Summary: Conversion from float vector to short/int8 vector not
                    optimized (AVX512, AVX2, and SSE2)
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: mkretz at gcc dot gnu.org
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

Test case. Compile with '-O2' and optionally with -march flags for AVX2 or
AVX512 (https://compiler-explorer.com/z/vWGKP9dv3).

The extractN functions convert from int8/int16 vector to float vectors. The
cvt1/cvt2 functions are equivalent and convert 2/4 float vectors back to one
int8/int16 vector. The testcase can/should also be adapted to uint8/uint16.

These conversions are common in signal processing code (device
produces/consumes int8/int16 measurements, processing happens in float; e.g.
audio devices). It's also useful for using divps for int8/int16 division.


#ifdef __AVX512F__
#define SEQ_0 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
#define SEQ_1 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
#define SEQ_2 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
#define SEQ_3 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63
#elif defined __AVX2__
#define SEQ_0 0, 1, 2, 3, 4, 5, 6, 7
#define SEQ_1 8, 9, 10, 11, 12, 13, 14, 15
#define SEQ_2 16, 17, 18, 19, 20, 21, 22, 23
#define SEQ_3 24, 25, 26, 27, 28, 29, 30, 31
#else
#define SEQ_0 0, 1, 2, 3
#define SEQ_1 4, 5, 6, 7
#define SEQ_2 8, 9, 10, 11
#define SEQ_3 12, 13, 14, 15
#endif

constexpr int N = (SEQ_0) + 1;

#define SEQ_01 SEQ_0, SEQ_1
#define SEQ SEQ_0, SEQ_1, SEQ_2, SEQ_3

using floatv [[gnu::vector_size(N * 4)]] = float;
using shortv [[gnu::vector_size(sizeof(floatv))]] = short;

floatv extract0(shortv x) {
  return __builtin_convertvector(__builtin_shufflevector(x, x, SEQ_0), floatv);
}

floatv extract1(shortv x) {
  return __builtin_convertvector(__builtin_shufflevector(x, x, SEQ_1), floatv);
}

auto cvt1(floatv x, floatv y) {
  return __builtin_convertvector(__builtin_shufflevector(x, y, SEQ_01),
shortv);
}

auto cvt2(floatv x, floatv y) {
  using V [[gnu::vector_size(sizeof(x) / 2)]] = short;
  return __builtin_shufflevector(__builtin_convertvector(x, V),
                                 __builtin_convertvector(y, V), SEQ_01);
}

// ------------------------

using int8v [[gnu::vector_size(sizeof(floatv))]] = signed char;

floatv extract0(int8v x) {
  return __builtin_convertvector(__builtin_shufflevector(x, x, SEQ_0), floatv);
}

floatv extract1(int8v x) {
  return __builtin_convertvector(__builtin_shufflevector(x, x, SEQ_1), floatv);
}

floatv extract2(int8v x) {
  return __builtin_convertvector(__builtin_shufflevector(x, x, SEQ_2), floatv);
}

floatv extract3(int8v x) {
  return __builtin_convertvector(__builtin_shufflevector(x, x, SEQ_3), floatv);
}

auto cvt1(floatv x, floatv y, floatv z, floatv a) {
  return __builtin_convertvector(
      __builtin_shufflevector(__builtin_shufflevector(x, y, SEQ_01),
                              __builtin_shufflevector(z, a, SEQ_01), SEQ),
      int8v);
}

auto cvt2(floatv x, floatv y, floatv z, floatv a) {
  using V [[gnu::vector_size(sizeof(x) / 4)]] = signed char;
  return __builtin_shufflevector(
      __builtin_shufflevector(__builtin_convertvector(x, V),
                              __builtin_convertvector(y, V), SEQ_01),
      __builtin_shufflevector(__builtin_convertvector(z, V),
                              __builtin_convertvector(a, V), SEQ_01),
      SEQ);
}


The extractN functions are fine for AVX2 and AVX512. For SSE2 the sequence for
int16 is okay but Clang 20 produces to a slightly nicer instruction sequence.
For int8 the extractN functions are not optimized.

The cvt1/cvt2 functions are not optimized in any case. The instruction
sequences Clang 20 emits look optimal to me. Note that
https://eel.is/c++draft/conv#fpint-1 grants the freedom to use saturating
conversion from int32 to int16/int8 here.

Expected instruction sequences:

float->int16 with SSE2:
        cvttps2dq       xmm1, xmm1
        cvttps2dq       xmm0, xmm0
        packssdw        xmm0, xmm1
        ret

float->int8 with SSE2:
        cvttps2dq       xmm3, xmm3
        cvttps2dq       xmm2, xmm2
        packssdw        xmm2, xmm3
        cvttps2dq       xmm1, xmm1
        cvttps2dq       xmm0, xmm0
        packssdw        xmm0, xmm1
        packsswb        xmm0, xmm2
        ret

float->int16 with AVX2:
        vcvttps2dq      ymm1, ymm1
        vcvttps2dq      ymm0, ymm0
        vpackssdw       ymm0, ymm0, ymm1
        vpermq  ymm0, ymm0, 216
        ret

float->int8 with AVX2:
        vcvttps2dq      ymm3, ymm3
        vextracti128    xmm4, ymm3, 1
        vpackssdw       xmm3, xmm3, xmm4
        vcvttps2dq      ymm1, ymm1
        vextracti128    xmm4, ymm1, 1
        vpackssdw       xmm1, xmm1, xmm4
        vinserti128     ymm1, ymm1, xmm3, 1
        vcvttps2dq      ymm2, ymm2
        vextracti128    xmm3, ymm2, 1
        vpackssdw       xmm2, xmm2, xmm3
        vcvttps2dq      ymm0, ymm0
        vextracti128    xmm3, ymm0, 1
        vpackssdw       xmm0, xmm0, xmm3
        vinserti128     ymm0, ymm0, xmm2, 1
        vpacksswb       ymm0, ymm0, ymm1
        ret

float->int16 with AVX512:
        vcvttps2dq      zmm0, zmm0
        vcvttps2dq      zmm1, zmm1
        vpmovdw ymm0, zmm0
        vpmovdw ymm1, zmm1
        vinserti64x4    zmm0, zmm0, ymm1, 1
        ret

float->int8 with AVX512:
        vcvttps2dq      zmm2, zmm2
        vcvttps2dq      zmm3, zmm3
        vcvttps2dq      zmm0, zmm0
        vcvttps2dq      zmm1, zmm1
        vpmovdb xmm2, zmm2
        vpmovdb xmm3, zmm3
        vpmovdb xmm0, zmm0
        vpmovdb xmm1, zmm1
        vinserti128     ymm2, ymm2, xmm3, 1
        vinserti128     ymm0, ymm0, xmm1, 1
        vinserti64x4    zmm0, zmm0, ymm2, 1
        ret

Reply via email to