https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048
--- Comment #12 from Uroš Bizjak <ubizjak at gmail dot com> --- (In reply to Hongtao.liu from comment #9) > With the patch, we can generate optimized code expect for those 16 {u,}qq > cases, since the ABI doesn't support 1024-bit vector. Can't these be vectorized using partial vectors? GCC generates: _Z9vcvtqq2psDv16_l: vmovq 56(%rsp), %xmm0 vmovq 40(%rsp), %xmm1 vmovq 88(%rsp), %xmm2 vmovq 120(%rsp), %xmm3 vpinsrq $1, 64(%rsp), %xmm0, %xmm0 vpinsrq $1, 48(%rsp), %xmm1, %xmm1 vpinsrq $1, 96(%rsp), %xmm2, %xmm2 vpinsrq $1, 128(%rsp), %xmm3, %xmm3 vinserti128 $0x1, %xmm0, %ymm1, %ymm1 vcvtqq2psy 8(%rsp), %xmm0 vcvtqq2psy %ymm1, %xmm1 vinsertf128 $0x1, %xmm1, %ymm0, %ymm0 vmovq 72(%rsp), %xmm1 vpinsrq $1, 80(%rsp), %xmm1, %xmm1 vinserti128 $0x1, %xmm2, %ymm1, %ymm1 vmovq 104(%rsp), %xmm2 vcvtqq2psy %ymm1, %xmm1 vpinsrq $1, 112(%rsp), %xmm2, %xmm2 vinserti128 $0x1, %xmm3, %ymm2, %ymm2 vcvtqq2psy %ymm2, %xmm2 vinsertf128 $0x1, %xmm2, %ymm1, %ymm1 vinsertf64x4 $0x1, %ymm1, %zmm0, %zmm0 where clang manages to vectorize the function to: vcvtqq2ps 16(%rbp), %ymm0 vcvtqq2ps 80(%rbp), %ymm1 vinsertf64x4 $1, %ymm1, %zmm0, %zmm0