https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102510

--- Comment #2 from Dalon Work <dwwork at gmail dot com> ---
Thanks for the information. Based on your comments, I've created 2 new
subroutines that call the "bad" function. The first places the result in a
contiguous array, while the second places the result in a strided array.
(https://godbolt.org/z/bTnWr3bMn)

The first:

subroutine add2vecs3(a,b,c)
    real(r32), dimension(8), intent(in) :: a,b
    real(r32), dimension(8), intent(out) :: c
    c = add2vecs2(a,b)
end subroutine

With "-O3 -mavx", this subroutine becomes fully vectorized:

__blah_MOD_add2vecs3:
        vmovups ymm0, YMMWORD PTR [rdi]
        vaddps  ymm0, ymm0, YMMWORD PTR [rsi]
        vmovups YMMWORD PTR [rdx], ymm0
        vzeroupper
        ret

The second:

subroutine add2vecs4(a,b,c)
    real(r32), dimension(8), intent(in) :: a,b
    real(r32), dimension(16), intent(out) :: c
    c(1:16:2) = add2vecs2(a,b)
end subroutine

In this case we get the non-vectorized version:

__blah_MOD_add2vecs4:
        vmovups ymm0, YMMWORD PTR [rsi]
        vaddps  ymm0, ymm0, YMMWORD PTR [rdi]
        vmovss  DWORD PTR [rdx], xmm0
        vextractps      DWORD PTR [rdx+8], xmm0, 1
        vextractps      DWORD PTR [rdx+16], xmm0, 2
        vextractps      DWORD PTR [rdx+24], xmm0, 3
        vextractf128    xmm0, ymm0, 0x1
        vmovss  DWORD PTR [rdx+32], xmm0
        vextractps      DWORD PTR [rdx+40], xmm0, 1
        vextractps      DWORD PTR [rdx+48], xmm0, 2
        vextractps      DWORD PTR [rdx+56], xmm0, 3
        vzeroupper
        ret

>From this, it seems you are correct. The result gets passed in as a descriptor
to a block of memory and from that the function figures out the best way to
fill in the data. Perhaps other compilers handle this differently, but there we
have it.

Changing this behavior might be difficult or impossible, as this would be an
ABI change, would it not? It's arguable whether it's even worth changing.
Perhaps other compilers do it differently. I guess what I assumed is that the
compiler would have a contigous block of memory available for the return
result. Any necessary striding would happen external to the function.

Reply via email to