[Bug tree-optimization/82518] [8 regression] gfortran.fortran-torture/execute/in-pack.f90 fails on armeb since r252917
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82518 Wilco changed: What|Removed |Added CC||wilco.dijkstra at arm dot com --- Comment #41 from Wilco --- I'm guessing it's this (64-bit loads reading 32-bit data): vldrd16, .L39 vldrd17, .L39+8 .L5: vsub.i32q10, q11, q8 add r3, r3, #1 vsub.i32q9, q8, q11 .L39: .word 1 .word 2 .word 3 .word 4
[Bug tree-optimization/82518] [8 regression] gfortran.fortran-torture/execute/in-pack.f90 fails on armeb since r252917
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82518 --- Comment #42 from Wilco --- Cut down example: typedef struct { int x, y; } X; void f (X *p, int n) { for (int i = 0; i < n; i++) { p[i].x = i; p[i].y = i + 1; } }
[Bug tree-optimization/82518] [8 regression] gfortran.fortran-torture/execute/in-pack.f90 fails on armeb since r252917
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82518 --- Comment #47 from Wilco --- (In reply to Jakub Jelinek from comment #46) > Wonder if that: > vect_array.11[0] = vect_vec_iv_.7_45; > vect_array.11[1] = vect__4.8_48; > on armeb shouldn't have been [1] and [0] instead, otherwise we end up with: > (insn 35 37 38 5 (set (subreg:V4SI (reg:OI 155 [ vect_array.11 ]) 0) > (reg:V4SI 110 [ vect_vec_iv_.7 ])) "pr82518.c":8 939 {*neon_movv4si} > (nil)) > (insn 38 35 41 5 (set (subreg:V4SI (reg:OI 155 [ vect_array.11 ]) 16) > (plus:V4SI (reg:V4SI 110 [ vect_vec_iv_.7 ]) > (reg:V4SI 171))) "pr82518.c":8 998 {*addv4si3_neon} > (nil)) > (insn 41 38 39 5 (set (reg:V4SI 110 [ vect_vec_iv_.7 ]) > (plus:V4SI (reg:V4SI 110 [ vect_vec_iv_.7 ]) > (reg:V4SI 169))) 998 {*addv4si3_neon} > (nil)) > (insn 39 41 43 5 (set (mem:OI (post_inc:SI (reg:SI 152 [ ivtmp.31 ])) [2 > MEM[(int *)vectp_p.9_49]+0 S32 A32]) > (unspec:OI [ > (reg:OI 155 [ vect_array.11 ]) > (unspec:V4SI [ > (const_int 0 [0]) > ] UNSPEC_VSTRUCTDUMMY) > ] UNSPEC_VST2)) "pr82518.c":8 2396 {neon_vst2v4si} > (expr_list:REG_INC (reg:SI 152 [ ivtmp.31 ]) > (nil))) > where pseudo 110 is the vect_vec_iv_.7_45 ({i, i + 1, i + 2, i + 3}) and > insn 38 adds {1, 1, 1, 1} to that. It really depends on what exactly the > neon_vst2v4si instruction does on armeb. > vmov.i32q10, #4 @ v4si > vmov.i32q9, #1 @ v4si > ... > vldrd16, .L19 > vldrd17, .L19+8 > .L4: > vadd.i32q11, q8, q9 > vst1.64 {d16-d17}, [sp:64] > vadd.i32q8, q8, q10 > vstrd22, [sp, #16] > vstrd23, [sp, #24] > vld1.64 {d22-d25}, [sp:64] > vst2.32 {d22-d25}, [r3]! > If it works like on armel, except the elements of the vectors are > byte-swapped, then it should be [1] and [0]. The vst2 works on little endian, but in big-endian the lane numbering is complex since all data is still treated as 64-bit quantities. The stores and vld1.64 have no effect on data layout, so everything is still 64-bit data in 64-bit registers. The vst2.32 can only be used in big-endian if the data is lane-swapped first. AArch64 in big-endian does this: .L26: mov v2.16b, v0.16b add v3.4s, v0.4s, v6.4s add v0.4s, v0.4s, v7.4s tbl v4.16b, {v2.16b}, v1.16b tbl v5.16b, {v3.16b}, v1.16b st2 {v4.4s - v5.4s}, [x2], 32 cmp x2, x3 bne .L26
[Bug tree-optimization/84114] global reassociation pass prevents fma usage, generates slower code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114 Wilco changed: What|Removed |Added CC||wilco.dijkstra at arm dot com --- Comment #5 from Wilco --- (In reply to Steve Ellcey from comment #4) > While teaching the reassociation pass about fma's seems like the right > answer would it be reasonable (and simpler) to do the fma pass > (pass_optimize_widening_mul) before > the reassociation pass (pass_reassoc) to get the most fma's? > > That fixes my small test case but I haven't done a bigger performance check > to see what the overall impact would be. I don't know what else that would affect since the reassociation phase runs very early - and it's late at this stage. My patch seems much safer. Even easier might be to return 1 for FLOAT_MODE PLUS_EXPR in aarch64_reassociation_width. Then we can fix the reassociation phase in GCC9.
[Bug tree-optimization/82518] [8 regression] gfortran.fortran-torture/execute/in-pack.f90 fails on armeb since r252917
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82518 --- Comment #49 from Wilco --- AArch64 does this: (define_expand "vec_store_lanesoi" [(set (match_operand:OI 0 "aarch64_simd_struct_operand" "=Utv") (unspec:OI [(match_operand:OI 1 "register_operand" "w") (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] UNSPEC_ST2))] "TARGET_SIMD" { if (BYTES_BIG_ENDIAN) { rtx tmp = gen_reg_rtx (OImode); rtx mask = aarch64_reverse_mask (mode, ); emit_insn (gen_aarch64_rev_reglistoi (tmp, operands[1], mask)); emit_insn (gen_aarch64_simd_st2 (operands[0], tmp)); } else emit_insn (gen_aarch64_simd_st2 (operands[0], operands[1])); DONE; }) ARM seems to be missing the swap: (define_expand "vec_store_lanesoi" [(set (match_operand:OI 0 "neon_struct_operand") (unspec:OI [(match_operand:OI 1 "s_register_operand") (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] UNSPEC_VST2))] "TARGET_NEON") So clearly looks like a backend issue.