https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82518
--- Comment #47 from Wilco <wilco.dijkstra at arm dot com> --- (In reply to Jakub Jelinek from comment #46) > Wonder if that: > vect_array.11[0] = vect_vec_iv_.7_45; > vect_array.11[1] = vect__4.8_48; > on armeb shouldn't have been [1] and [0] instead, otherwise we end up with: > (insn 35 37 38 5 (set (subreg:V4SI (reg:OI 155 [ vect_array.11 ]) 0) > (reg:V4SI 110 [ vect_vec_iv_.7 ])) "pr82518.c":8 939 {*neon_movv4si} > (nil)) > (insn 38 35 41 5 (set (subreg:V4SI (reg:OI 155 [ vect_array.11 ]) 16) > (plus:V4SI (reg:V4SI 110 [ vect_vec_iv_.7 ]) > (reg:V4SI 171))) "pr82518.c":8 998 {*addv4si3_neon} > (nil)) > (insn 41 38 39 5 (set (reg:V4SI 110 [ vect_vec_iv_.7 ]) > (plus:V4SI (reg:V4SI 110 [ vect_vec_iv_.7 ]) > (reg:V4SI 169))) 998 {*addv4si3_neon} > (nil)) > (insn 39 41 43 5 (set (mem:OI (post_inc:SI (reg:SI 152 [ ivtmp.31 ])) [2 > MEM[(int *)vectp_p.9_49]+0 S32 A32]) > (unspec:OI [ > (reg:OI 155 [ vect_array.11 ]) > (unspec:V4SI [ > (const_int 0 [0]) > ] UNSPEC_VSTRUCTDUMMY) > ] UNSPEC_VST2)) "pr82518.c":8 2396 {neon_vst2v4si} > (expr_list:REG_INC (reg:SI 152 [ ivtmp.31 ]) > (nil))) > where pseudo 110 is the vect_vec_iv_.7_45 ({i, i + 1, i + 2, i + 3}) and > insn 38 adds {1, 1, 1, 1} to that. It really depends on what exactly the > neon_vst2v4si instruction does on armeb. > vmov.i32 q10, #4 @ v4si > vmov.i32 q9, #1 @ v4si > ... > vldr d16, .L19 > vldr d17, .L19+8 > .L4: > vadd.i32 q11, q8, q9 > vst1.64 {d16-d17}, [sp:64] > vadd.i32 q8, q8, q10 > vstr d22, [sp, #16] > vstr d23, [sp, #24] > vld1.64 {d22-d25}, [sp:64] > vst2.32 {d22-d25}, [r3]! > If it works like on armel, except the elements of the vectors are > byte-swapped, then it should be [1] and [0]. The vst2 works on little endian, but in big-endian the lane numbering is complex since all data is still treated as 64-bit quantities. The stores and vld1.64 have no effect on data layout, so everything is still 64-bit data in 64-bit registers. The vst2.32 can only be used in big-endian if the data is lane-swapped first. AArch64 in big-endian does this: .L26: mov v2.16b, v0.16b add v3.4s, v0.4s, v6.4s add v0.4s, v0.4s, v7.4s tbl v4.16b, {v2.16b}, v1.16b tbl v5.16b, {v3.16b}, v1.16b st2 {v4.4s - v5.4s}, [x2], 32 cmp x2, x3 bne .L26