[Bug tree-optimization/82518] [8 regression] gfortran.fortran-torture/execute/in-pack.f90 fails on armeb since r252917

2018-02-15 Thread wilco.dijkstra at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82518

Wilco  changed:

   What|Removed |Added

 CC||wilco.dijkstra at arm dot com

--- Comment #41 from Wilco  ---
I'm guessing it's this (64-bit loads reading 32-bit data):

vldrd16, .L39
vldrd17, .L39+8
.L5:
vsub.i32q10, q11, q8
add r3, r3, #1
vsub.i32q9, q8, q11


.L39:
.word   1
.word   2
.word   3
.word   4

[Bug tree-optimization/82518] [8 regression] gfortran.fortran-torture/execute/in-pack.f90 fails on armeb since r252917

2018-02-15 Thread wilco.dijkstra at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82518

--- Comment #42 from Wilco  ---
Cut down example:

typedef struct { int x, y; } X;

void f (X *p, int n)
{
  for (int i = 0; i < n; i++)
  { p[i].x = i;
p[i].y = i + 1;
  }
}

[Bug tree-optimization/82518] [8 regression] gfortran.fortran-torture/execute/in-pack.f90 fails on armeb since r252917

2018-02-16 Thread wilco.dijkstra at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82518

--- Comment #47 from Wilco  ---

(In reply to Jakub Jelinek from comment #46)
> Wonder if that:
>   vect_array.11[0] = vect_vec_iv_.7_45;
>   vect_array.11[1] = vect__4.8_48;
> on armeb shouldn't have been [1] and [0] instead, otherwise we end up with:
> (insn 35 37 38 5 (set (subreg:V4SI (reg:OI 155 [ vect_array.11 ]) 0)
> (reg:V4SI 110 [ vect_vec_iv_.7 ])) "pr82518.c":8 939 {*neon_movv4si}
>  (nil))
> (insn 38 35 41 5 (set (subreg:V4SI (reg:OI 155 [ vect_array.11 ]) 16)
> (plus:V4SI (reg:V4SI 110 [ vect_vec_iv_.7 ])
> (reg:V4SI 171))) "pr82518.c":8 998 {*addv4si3_neon}
>  (nil))
> (insn 41 38 39 5 (set (reg:V4SI 110 [ vect_vec_iv_.7 ])
> (plus:V4SI (reg:V4SI 110 [ vect_vec_iv_.7 ])
> (reg:V4SI 169))) 998 {*addv4si3_neon}
>  (nil))
> (insn 39 41 43 5 (set (mem:OI (post_inc:SI (reg:SI 152 [ ivtmp.31 ])) [2
> MEM[(int *)vectp_p.9_49]+0 S32 A32])
> (unspec:OI [
> (reg:OI 155 [ vect_array.11 ])
> (unspec:V4SI [
> (const_int 0 [0])
> ] UNSPEC_VSTRUCTDUMMY)
> ] UNSPEC_VST2)) "pr82518.c":8 2396 {neon_vst2v4si}
>  (expr_list:REG_INC (reg:SI 152 [ ivtmp.31 ])
> (nil)))
> where pseudo 110 is the vect_vec_iv_.7_45 ({i, i + 1, i + 2, i + 3}) and
> insn 38 adds {1, 1, 1, 1} to that.  It really depends on what exactly the
> neon_vst2v4si instruction does on armeb.
> vmov.i32q10, #4  @ v4si
> vmov.i32q9, #1  @ v4si
> ...
> vldrd16, .L19
> vldrd17, .L19+8
> .L4:
> vadd.i32q11, q8, q9
> vst1.64 {d16-d17}, [sp:64]
> vadd.i32q8, q8, q10
> vstrd22, [sp, #16]
> vstrd23, [sp, #24]
> vld1.64 {d22-d25}, [sp:64]
> vst2.32 {d22-d25}, [r3]!
> If it works like on armel, except the elements of the vectors are
> byte-swapped, then it should be [1] and [0].

The vst2 works on little endian, but in big-endian the lane numbering is
complex since all data is still treated as 64-bit quantities. 

The stores and vld1.64 have no effect on data layout, so everything is still
64-bit data in 64-bit registers. The vst2.32 can only be used in big-endian if
the data is lane-swapped first. AArch64 in big-endian does this:

.L26:
mov v2.16b, v0.16b
add v3.4s, v0.4s, v6.4s
add v0.4s, v0.4s, v7.4s
tbl v4.16b, {v2.16b}, v1.16b
tbl v5.16b, {v3.16b}, v1.16b
st2 {v4.4s - v5.4s}, [x2], 32
cmp x2, x3
bne .L26

[Bug tree-optimization/84114] global reassociation pass prevents fma usage, generates slower code

2018-02-16 Thread wilco.dijkstra at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84114

Wilco  changed:

   What|Removed |Added

 CC||wilco.dijkstra at arm dot com

--- Comment #5 from Wilco  ---
(In reply to Steve Ellcey from comment #4)
> While teaching the reassociation pass about fma's seems like the right
> answer would it be reasonable (and simpler) to do the fma pass
> (pass_optimize_widening_mul) before
> the reassociation pass (pass_reassoc) to get the most fma's?
> 
> That fixes my small test case but I haven't done a bigger performance check
> to see what the overall impact would be.

I don't know what else that would affect since the reassociation phase runs
very early - and it's late at this stage. My patch seems much safer. Even
easier might be to return 1 for FLOAT_MODE PLUS_EXPR in
aarch64_reassociation_width. Then we can fix the reassociation phase in GCC9.

[Bug tree-optimization/82518] [8 regression] gfortran.fortran-torture/execute/in-pack.f90 fails on armeb since r252917

2018-02-16 Thread wilco.dijkstra at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82518

--- Comment #49 from Wilco  ---
AArch64 does this:

(define_expand "vec_store_lanesoi"
  [(set (match_operand:OI 0 "aarch64_simd_struct_operand" "=Utv")
(unspec:OI [(match_operand:OI 1 "register_operand" "w")
(unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
   UNSPEC_ST2))]
  "TARGET_SIMD"
{
  if (BYTES_BIG_ENDIAN)
{
  rtx tmp = gen_reg_rtx (OImode);
  rtx mask = aarch64_reverse_mask (mode, );
  emit_insn (gen_aarch64_rev_reglistoi (tmp, operands[1], mask));
  emit_insn (gen_aarch64_simd_st2 (operands[0], tmp));
}
  else
emit_insn (gen_aarch64_simd_st2 (operands[0], operands[1]));
  DONE;
})

ARM seems to be missing the swap:

(define_expand "vec_store_lanesoi"
  [(set (match_operand:OI 0 "neon_struct_operand")
(unspec:OI [(match_operand:OI 1 "s_register_operand")
(unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
   UNSPEC_VST2))]
  "TARGET_NEON")

So clearly looks like a backend issue.