https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

--- Comment #12 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Hongtao.liu from comment #9)
> With the patch, we can generate optimized code expect for those 16 {u,}qq
> cases, since the ABI doesn't support 1024-bit vector.

Can't these be vectorized using partial vectors? GCC generates:

_Z9vcvtqq2psDv16_l:
        vmovq   56(%rsp), %xmm0
        vmovq   40(%rsp), %xmm1
        vmovq   88(%rsp), %xmm2
        vmovq   120(%rsp), %xmm3
        vpinsrq $1, 64(%rsp), %xmm0, %xmm0
        vpinsrq $1, 48(%rsp), %xmm1, %xmm1
        vpinsrq $1, 96(%rsp), %xmm2, %xmm2
        vpinsrq $1, 128(%rsp), %xmm3, %xmm3
        vinserti128     $0x1, %xmm0, %ymm1, %ymm1
        vcvtqq2psy      8(%rsp), %xmm0
        vcvtqq2psy      %ymm1, %xmm1
        vinsertf128     $0x1, %xmm1, %ymm0, %ymm0
        vmovq   72(%rsp), %xmm1
        vpinsrq $1, 80(%rsp), %xmm1, %xmm1
        vinserti128     $0x1, %xmm2, %ymm1, %ymm1
        vmovq   104(%rsp), %xmm2
        vcvtqq2psy      %ymm1, %xmm1
        vpinsrq $1, 112(%rsp), %xmm2, %xmm2
        vinserti128     $0x1, %xmm3, %ymm2, %ymm2
        vcvtqq2psy      %ymm2, %xmm2
        vinsertf128     $0x1, %xmm2, %ymm1, %ymm1
        vinsertf64x4    $0x1, %ymm1, %zmm0, %zmm0

where clang manages to vectorize the function to:

  vcvtqq2ps 16(%rbp), %ymm0
  vcvtqq2ps 80(%rbp), %ymm1
  vinsertf64x4 $1, %ymm1, %zmm0, %zmm0

Reply via email to