https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82518

--- Comment #47 from Wilco <wilco.dijkstra at arm dot com> ---

(In reply to Jakub Jelinek from comment #46)
> Wonder if that:
>   vect_array.11[0] = vect_vec_iv_.7_45;
>   vect_array.11[1] = vect__4.8_48;
> on armeb shouldn't have been [1] and [0] instead, otherwise we end up with:
> (insn 35 37 38 5 (set (subreg:V4SI (reg:OI 155 [ vect_array.11 ]) 0)
>         (reg:V4SI 110 [ vect_vec_iv_.7 ])) "pr82518.c":8 939 {*neon_movv4si}
>      (nil))
> (insn 38 35 41 5 (set (subreg:V4SI (reg:OI 155 [ vect_array.11 ]) 16)
>         (plus:V4SI (reg:V4SI 110 [ vect_vec_iv_.7 ])
>             (reg:V4SI 171))) "pr82518.c":8 998 {*addv4si3_neon}
>      (nil))
> (insn 41 38 39 5 (set (reg:V4SI 110 [ vect_vec_iv_.7 ])
>         (plus:V4SI (reg:V4SI 110 [ vect_vec_iv_.7 ])
>             (reg:V4SI 169))) 998 {*addv4si3_neon}
>      (nil))
> (insn 39 41 43 5 (set (mem:OI (post_inc:SI (reg:SI 152 [ ivtmp.31 ])) [2
> MEM[(int *)vectp_p.9_49]+0 S32 A32])
>         (unspec:OI [
>                 (reg:OI 155 [ vect_array.11 ])
>                 (unspec:V4SI [
>                         (const_int 0 [0])
>                     ] UNSPEC_VSTRUCTDUMMY)
>             ] UNSPEC_VST2)) "pr82518.c":8 2396 {neon_vst2v4si}
>      (expr_list:REG_INC (reg:SI 152 [ ivtmp.31 ])
>         (nil)))
> where pseudo 110 is the vect_vec_iv_.7_45 ({i, i + 1, i + 2, i + 3}) and
> insn 38 adds {1, 1, 1, 1} to that.  It really depends on what exactly the
> neon_vst2v4si instruction does on armeb.
>         vmov.i32        q10, #4  @ v4si
>         vmov.i32        q9, #1  @ v4si
> ...
>         vldr    d16, .L19
>         vldr    d17, .L19+8
> .L4:
>         vadd.i32        q11, q8, q9
>         vst1.64 {d16-d17}, [sp:64]
>         vadd.i32        q8, q8, q10
>         vstr    d22, [sp, #16]
>         vstr    d23, [sp, #24]
>         vld1.64 {d22-d25}, [sp:64]
>         vst2.32 {d22-d25}, [r3]!
> If it works like on armel, except the elements of the vectors are
> byte-swapped, then it should be [1] and [0].

The vst2 works on little endian, but in big-endian the lane numbering is
complex since all data is still treated as 64-bit quantities. 

The stores and vld1.64 have no effect on data layout, so everything is still
64-bit data in 64-bit registers. The vst2.32 can only be used in big-endian if
the data is lane-swapped first. AArch64 in big-endian does this:

.L26:
        mov     v2.16b, v0.16b
        add     v3.4s, v0.4s, v6.4s
        add     v0.4s, v0.4s, v7.4s
        tbl     v4.16b, {v2.16b}, v1.16b
        tbl     v5.16b, {v3.16b}, v1.16b
        st2     {v4.4s - v5.4s}, [x2], 32
        cmp     x2, x3
        bne     .L26

Reply via email to