https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121296
Bug ID: 121296 Summary: Conversion from float vector to short/int8 vector not optimized (AVX512, AVX2, and SSE2) Product: gcc Version: 16.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: mkretz at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-*, i?86-*-* Test case. Compile with '-O2' and optionally with -march flags for AVX2 or AVX512 (https://compiler-explorer.com/z/vWGKP9dv3). The extractN functions convert from int8/int16 vector to float vectors. The cvt1/cvt2 functions are equivalent and convert 2/4 float vectors back to one int8/int16 vector. The testcase can/should also be adapted to uint8/uint16. These conversions are common in signal processing code (device produces/consumes int8/int16 measurements, processing happens in float; e.g. audio devices). It's also useful for using divps for int8/int16 division. #ifdef __AVX512F__ #define SEQ_0 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 #define SEQ_1 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 #define SEQ_2 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47 #define SEQ_3 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63 #elif defined __AVX2__ #define SEQ_0 0, 1, 2, 3, 4, 5, 6, 7 #define SEQ_1 8, 9, 10, 11, 12, 13, 14, 15 #define SEQ_2 16, 17, 18, 19, 20, 21, 22, 23 #define SEQ_3 24, 25, 26, 27, 28, 29, 30, 31 #else #define SEQ_0 0, 1, 2, 3 #define SEQ_1 4, 5, 6, 7 #define SEQ_2 8, 9, 10, 11 #define SEQ_3 12, 13, 14, 15 #endif constexpr int N = (SEQ_0) + 1; #define SEQ_01 SEQ_0, SEQ_1 #define SEQ SEQ_0, SEQ_1, SEQ_2, SEQ_3 using floatv [[gnu::vector_size(N * 4)]] = float; using shortv [[gnu::vector_size(sizeof(floatv))]] = short; floatv extract0(shortv x) { return __builtin_convertvector(__builtin_shufflevector(x, x, SEQ_0), floatv); } floatv extract1(shortv x) { return __builtin_convertvector(__builtin_shufflevector(x, x, SEQ_1), floatv); } auto cvt1(floatv x, floatv y) { return __builtin_convertvector(__builtin_shufflevector(x, y, SEQ_01), shortv); } auto cvt2(floatv x, floatv y) { using V [[gnu::vector_size(sizeof(x) / 2)]] = short; return __builtin_shufflevector(__builtin_convertvector(x, V), __builtin_convertvector(y, V), SEQ_01); } // ------------------------ using int8v [[gnu::vector_size(sizeof(floatv))]] = signed char; floatv extract0(int8v x) { return __builtin_convertvector(__builtin_shufflevector(x, x, SEQ_0), floatv); } floatv extract1(int8v x) { return __builtin_convertvector(__builtin_shufflevector(x, x, SEQ_1), floatv); } floatv extract2(int8v x) { return __builtin_convertvector(__builtin_shufflevector(x, x, SEQ_2), floatv); } floatv extract3(int8v x) { return __builtin_convertvector(__builtin_shufflevector(x, x, SEQ_3), floatv); } auto cvt1(floatv x, floatv y, floatv z, floatv a) { return __builtin_convertvector( __builtin_shufflevector(__builtin_shufflevector(x, y, SEQ_01), __builtin_shufflevector(z, a, SEQ_01), SEQ), int8v); } auto cvt2(floatv x, floatv y, floatv z, floatv a) { using V [[gnu::vector_size(sizeof(x) / 4)]] = signed char; return __builtin_shufflevector( __builtin_shufflevector(__builtin_convertvector(x, V), __builtin_convertvector(y, V), SEQ_01), __builtin_shufflevector(__builtin_convertvector(z, V), __builtin_convertvector(a, V), SEQ_01), SEQ); } The extractN functions are fine for AVX2 and AVX512. For SSE2 the sequence for int16 is okay but Clang 20 produces to a slightly nicer instruction sequence. For int8 the extractN functions are not optimized. The cvt1/cvt2 functions are not optimized in any case. The instruction sequences Clang 20 emits look optimal to me. Note that https://eel.is/c++draft/conv#fpint-1 grants the freedom to use saturating conversion from int32 to int16/int8 here. Expected instruction sequences: float->int16 with SSE2: cvttps2dq xmm1, xmm1 cvttps2dq xmm0, xmm0 packssdw xmm0, xmm1 ret float->int8 with SSE2: cvttps2dq xmm3, xmm3 cvttps2dq xmm2, xmm2 packssdw xmm2, xmm3 cvttps2dq xmm1, xmm1 cvttps2dq xmm0, xmm0 packssdw xmm0, xmm1 packsswb xmm0, xmm2 ret float->int16 with AVX2: vcvttps2dq ymm1, ymm1 vcvttps2dq ymm0, ymm0 vpackssdw ymm0, ymm0, ymm1 vpermq ymm0, ymm0, 216 ret float->int8 with AVX2: vcvttps2dq ymm3, ymm3 vextracti128 xmm4, ymm3, 1 vpackssdw xmm3, xmm3, xmm4 vcvttps2dq ymm1, ymm1 vextracti128 xmm4, ymm1, 1 vpackssdw xmm1, xmm1, xmm4 vinserti128 ymm1, ymm1, xmm3, 1 vcvttps2dq ymm2, ymm2 vextracti128 xmm3, ymm2, 1 vpackssdw xmm2, xmm2, xmm3 vcvttps2dq ymm0, ymm0 vextracti128 xmm3, ymm0, 1 vpackssdw xmm0, xmm0, xmm3 vinserti128 ymm0, ymm0, xmm2, 1 vpacksswb ymm0, ymm0, ymm1 ret float->int16 with AVX512: vcvttps2dq zmm0, zmm0 vcvttps2dq zmm1, zmm1 vpmovdw ymm0, zmm0 vpmovdw ymm1, zmm1 vinserti64x4 zmm0, zmm0, ymm1, 1 ret float->int8 with AVX512: vcvttps2dq zmm2, zmm2 vcvttps2dq zmm3, zmm3 vcvttps2dq zmm0, zmm0 vcvttps2dq zmm1, zmm1 vpmovdb xmm2, zmm2 vpmovdb xmm3, zmm3 vpmovdb xmm0, zmm0 vpmovdb xmm1, zmm1 vinserti128 ymm2, ymm2, xmm3, 1 vinserti128 ymm0, ymm0, xmm1, 1 vinserti64x4 zmm0, zmm0, ymm2, 1 ret