https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88605
Bug ID: 88605 Summary: vector extensions: Widening or conversion generates inefficient or scalar code. Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- If you want to, say, convert a u32x2 vector to a u64x2 while avoiding intrinsics, good luck. GCC doesn't have a builtin like __builtin_convertvector, and doing the conversion manually generates scalar code. This makes clean generic vector code difficult. SSE and NEON both have plenty of conversion instructions, such as pmovzxdq or vmovl.32, but GCC will not emit them. typedef unsigned long long U64; typedef U64 U64x2 __attribute__((vector_size(16))); typedef unsigned int U32; typedef U32 U32x2 __attribute__((vector_size(8))); U64x2 vconvert_u64_u32(U32x2 v) { return (U64x2) { v[0], v[1] }; } x86_32: Flags: -O3 -m32 -msse4.1 Clang Trunk (revision 350063) vconvert_u64_u32: pmovzxdq xmm0, qword ptr [esp + 4] # xmm0 = mem[0],zero,mem[1],zero ret GCC (GCC-Explorer-Build) 9.0.0 20181225 (experimental) convert_u64_u32: push ebx sub esp, 40 movq QWORD PTR [esp+8], mm0 mov ecx, DWORD PTR [esp+8] mov ebx, DWORD PTR [esp+12] mov DWORD PTR [esp+8], ecx movd xmm0, DWORD PTR [esp+8] mov DWORD PTR [esp+20], ebx movd xmm1, DWORD PTR [esp+20] mov DWORD PTR [esp+16], ecx add esp, 40 punpcklqdq xmm0, xmm1 pop ebx ret I can't even understand what is going on here, except it is wasting 44 bytes of stack for no good reason. x86_64: Flags: -O3 -m64 -msse4.1 Clang: vconvert_u64_u32: pmovzxdq xmm0, xmm0 # xmm0 = xmm0[0],zero,xmm0[1],zero ret GCC: vconvert_u64_u32: movq rax, xmm0 movd DWORD PTR [rsp-28], xmm0 movd xmm0, DWORD PTR [rsp-28] shr rax, 32 pinsrq xmm0, rax, 1 ret ARMv7 NEON: Flags: -march=armv7-a -mfloat-abi=hard -mfpu=neon -O3 Clang (with --target=arm-none-eabi): vconvert_u64_u32: vmovl.u32 q0, d0 bx lr arm-unknown-linux-gnueabi-gcc (GCC) 8.2.0: vconvert_u64_u32: mov r3, #0 sub sp, sp, #16 add r2, sp, #8 vst1.32 {d0[0]}, [sp] vst1.32 {d0[1]}, [r2] str r3, [sp, #4] str r3, [sp, #12] vld1.64 {d0-d1}, [sp:64] add sp, sp, #16 bx lr aarch64 NEON: Flags: -O3 Clang (with --target=aarch64-none-eabi): vconvert_u64_u32: ushll v0.2d, v0.2s, #0 ret aarch64-unknown-linux-gnu-gcc 8.2.0: vconvert_u64_u32: umov w1, v0.s[0] umov w0, v0.s[1] uxtw x1, w1 uxtw x0, w0 dup v0.2d, x1 ins v0.d[1], x0 ret Some other things include things like getting a standalone pmuludq. In clang, this always generates pmuludq: U64x2 pmuludq(U64x2 v1, U64x2 v2) { return (v1 & 0xFFFFFFFF) * (v2 & 0xFFFFFFFF); } But GCC generates this: pmuludq: movdqa xmm2, XMMWORD PTR .LC0[rip] pand xmm0, xmm2 pand xmm2, xmm1 movdqa xmm4, xmm2 movdqa xmm1, xmm0 movdqa xmm3, xmm0 psrlq xmm4, 32 psrlq xmm1, 32 pmuludq xmm0, xmm4 pmuludq xmm1, xmm2 pmuludq xmm3, xmm2 paddq xmm1, xmm0 psllq xmm1, 32 paddq xmm3, xmm1 movdqa xmm0, xmm3 ret .LC0: .quad 4294967295 .quad 4294967295 and that is the best code it generates. Much worse code is generated depending on how you write it. Meanwhile, while it has some struggles with sse2 and x86_64, there is a reliable way to get Clang to generate pmuludq, and the NEON equivalent, vmull.u32, https://godbolt.org/z/H_tOi1