[PATCH] rs6000: Add new pass for replacement of contiguous adresses vector load lxv with lxvp

2023-09-29 Thread Ajit Agarwal
Hello All:

This patch add new pass to replace contiguous addresses vector load lxv with 
mma instruction
lxvp.

Bootstrapped and regtested with powepc64-linux-gnu.

Thanks & Regards
Ajit

rs6000: Add new pass for replacement of contiguous lxv with lxvp

New pass to replace contiguous addresses vector load (lxv) with mma 
instruction lxvp. This pass is registered before cse rtl pass.

2023-09-29  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/rs6000/rs6000-passes.def: Registered vecload pass.
* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
* config.gcc: Add new executable.
* config/rs6000/rs6000-protos.h: Add new prototype for vecload
pass.
* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
* config/rs6000/t-rs6000: Add new rule.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/vecload.C: New test.
---
 gcc/config.gcc |   4 +-
 gcc/config/rs6000/rs6000-passes.def|   1 +
 gcc/config/rs6000/rs6000-protos.h  |   2 +
 gcc/config/rs6000/rs6000-vecload-opt.cc| 207 +
 gcc/config/rs6000/rs6000.cc|   3 +-
 gcc/config/rs6000/t-rs6000 |   4 +
 gcc/testsuite/g++.target/powerpc/vecload.C |  15 ++
 7 files changed, 233 insertions(+), 3 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C

diff --git a/gcc/config.gcc b/gcc/config.gcc
index ee46d96bf62..482ab094b89 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -515,7 +515,7 @@ or1k*-*-*)
;;
 powerpc*-*-*)
cpu_type=rs6000
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
@@ -552,7 +552,7 @@ riscv*)
;;
 rs6000*-*-*)
extra_options="${extra_options} g.opt fused-madd.opt 
rs6000/rs6000-tables.opt"
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-logue.cc 
\$(srcdir)/config/rs6000/rs6000-call.cc"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
diff --git a/gcc/config/rs6000/rs6000-passes.def 
b/gcc/config/rs6000/rs6000-passes.def
index ca899d5f7af..58a74058c6a 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -28,6 +28,7 @@ along with GCC; see the file COPYING3.  If not see
  The power8 does not have instructions that automaticaly do the byte swaps
  for loads and stores.  */
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
+  INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_vecload);
 
   /* Pass to do the PCREL_OPT optimization that combines the load of an
  external symbol's address along with a single load or store using that
diff --git a/gcc/config/rs6000/rs6000-protos.h 
b/gcc/config/rs6000/rs6000-protos.h
index f70118ea40f..9c44bae33d3 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -91,6 +91,7 @@ extern int mems_ok_for_quad_peep (rtx, rtx);
 extern bool gpr_or_gpr_p (rtx, rtx);
 extern bool direct_move_p (rtx, rtx);
 extern bool quad_address_p (rtx, machine_mode, bool);
+extern bool mode_supports_dq_form (machine_mode);
 extern bool quad_load_store_p (rtx, rtx);
 extern bool fusion_gpr_load_p (rtx, rtx, rtx, rtx);
 extern void expand_fusion_gpr_load (rtx *);
@@ -344,6 +345,7 @@ class rtl_opt_pass;
 
 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
 extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
+extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
 extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
 extern bool rs6000_quadword_masked_address_p (const_rtx exp);
 extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
diff --git a/gcc/config/rs6000/rs6000-vecload-opt.cc 
b/gcc/config/rs6000/rs6000-vecload-opt.cc
new file mode 100644
index 000..955e5d6361b
--- /dev/null
+++ b/gcc/config/rs6000/rs6000-vecload-opt.cc
@@ -0,0 +1,207 @@
+/* Subroutines used to replace lxv with lxvp
+   for p10 little-endian VSX code.
+   Copyright (C) 1991-2023 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; withou

[PATCH v1] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp

2023-10-06 Thread Ajit Agarwal
Hello All:

This patch add new pass to replace contiguous addresses vector load lxv with 
mma instruction
lxvp.

Bootstrapped and regtested with powepc64-linux-gnu.

Thanks & Regards
Ajit


rs6000: Add new pass for replacement of contiguous lxv with lxvp.

New pass to replace contiguous addresses lxv with lxvp. This pass
is registered after ree rtl pass.

2023-10-07  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/rs6000/rs6000-passes.def: Registered vecload pass.
* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
* config.gcc: Add new executable.
* config/rs6000/rs6000-protos.h: Add new prototype for vecload
pass.
* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
* config/rs6000/t-rs6000: Add new rule.
* df-scan.cc: Modified gcc assert.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/vecload.C: New test.
---
 gcc/config.gcc |   4 +-
 gcc/config/rs6000/rs6000-passes.def|   1 +
 gcc/config/rs6000/rs6000-protos.h  |   2 +
 gcc/config/rs6000/rs6000-vecload-opt.cc| 233 +
 gcc/config/rs6000/rs6000.cc|   3 +-
 gcc/config/rs6000/t-rs6000 |   4 +
 gcc/df-scan.cc |  10 +-
 gcc/testsuite/g++.target/powerpc/vecload.C |  15 ++
 8 files changed, 267 insertions(+), 5 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C

diff --git a/gcc/config.gcc b/gcc/config.gcc
index ee46d96bf62..482ab094b89 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -515,7 +515,7 @@ or1k*-*-*)
;;
 powerpc*-*-*)
cpu_type=rs6000
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
@@ -552,7 +552,7 @@ riscv*)
;;
 rs6000*-*-*)
extra_options="${extra_options} g.opt fused-madd.opt 
rs6000/rs6000-tables.opt"
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-logue.cc 
\$(srcdir)/config/rs6000/rs6000-call.cc"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
diff --git a/gcc/config/rs6000/rs6000-passes.def 
b/gcc/config/rs6000/rs6000-passes.def
index ca899d5f7af..9ecf8ce6a9c 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -28,6 +28,7 @@ along with GCC; see the file COPYING3.  If not see
  The power8 does not have instructions that automaticaly do the byte swaps
  for loads and stores.  */
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
+  INSERT_PASS_AFTER (pass_ree, 1, pass_analyze_vecload);
 
   /* Pass to do the PCREL_OPT optimization that combines the load of an
  external symbol's address along with a single load or store using that
diff --git a/gcc/config/rs6000/rs6000-protos.h 
b/gcc/config/rs6000/rs6000-protos.h
index f70118ea40f..9c44bae33d3 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -91,6 +91,7 @@ extern int mems_ok_for_quad_peep (rtx, rtx);
 extern bool gpr_or_gpr_p (rtx, rtx);
 extern bool direct_move_p (rtx, rtx);
 extern bool quad_address_p (rtx, machine_mode, bool);
+extern bool mode_supports_dq_form (machine_mode);
 extern bool quad_load_store_p (rtx, rtx);
 extern bool fusion_gpr_load_p (rtx, rtx, rtx, rtx);
 extern void expand_fusion_gpr_load (rtx *);
@@ -344,6 +345,7 @@ class rtl_opt_pass;
 
 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
 extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
+extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
 extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
 extern bool rs6000_quadword_masked_address_p (const_rtx exp);
 extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
diff --git a/gcc/config/rs6000/rs6000-vecload-opt.cc 
b/gcc/config/rs6000/rs6000-vecload-opt.cc
new file mode 100644
index 000..1fa3628d634
--- /dev/null
+++ b/gcc/config/rs6000/rs6000-vecload-opt.cc
@@ -0,0 +1,233 @@
+/* Subroutines used to replace lxv with lxvp
+   for p10 little-endian VSX code.
+   Copyright (C) 2020-2023 Free Software Foundation, Inc.
+   Contributed by Ajit Kumar Agarwal .
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later v

[PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp

2023-10-07 Thread Ajit Agarwal
Hello All:

This patch add new pass to replace contiguous addresses vector load lxv with 
mma instruction
lxvp. This patch addresses one regressions failure in ARM architecture.

Bootstrapped and regtested with powepc64-linux-gnu.

Thanks & Regards
Ajit


rs6000: Add new pass for replacement of contiguous lxv with lxvp.

New pass to replace contiguous addresses lxv with lxvp. This pass
is registered after ree rtl pass.

2023-10-07  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/rs6000/rs6000-passes.def: Registered vecload pass.
* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
* config.gcc: Add new executable.
* config/rs6000/rs6000-protos.h: Add new prototype for vecload
pass.
* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
* config/rs6000/t-rs6000: Add new rule.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/vecload.C: New test.
---
 gcc/config.gcc |   4 +-
 gcc/config/rs6000/rs6000-passes.def|   1 +
 gcc/config/rs6000/rs6000-protos.h  |   2 +
 gcc/config/rs6000/rs6000-vecload-opt.cc| 234 +
 gcc/config/rs6000/rs6000.cc|   3 +-
 gcc/config/rs6000/t-rs6000 |   4 +
 gcc/testsuite/g++.target/powerpc/vecload.C |  15 ++
 7 files changed, 260 insertions(+), 3 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C

diff --git a/gcc/config.gcc b/gcc/config.gcc
index ee46d96bf62..482ab094b89 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -515,7 +515,7 @@ or1k*-*-*)
;;
 powerpc*-*-*)
cpu_type=rs6000
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
@@ -552,7 +552,7 @@ riscv*)
;;
 rs6000*-*-*)
extra_options="${extra_options} g.opt fused-madd.opt 
rs6000/rs6000-tables.opt"
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-logue.cc 
\$(srcdir)/config/rs6000/rs6000-call.cc"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
diff --git a/gcc/config/rs6000/rs6000-passes.def 
b/gcc/config/rs6000/rs6000-passes.def
index ca899d5f7af..9ecf8ce6a9c 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -28,6 +28,7 @@ along with GCC; see the file COPYING3.  If not see
  The power8 does not have instructions that automaticaly do the byte swaps
  for loads and stores.  */
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
+  INSERT_PASS_AFTER (pass_ree, 1, pass_analyze_vecload);
 
   /* Pass to do the PCREL_OPT optimization that combines the load of an
  external symbol's address along with a single load or store using that
diff --git a/gcc/config/rs6000/rs6000-protos.h 
b/gcc/config/rs6000/rs6000-protos.h
index f70118ea40f..9c44bae33d3 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -91,6 +91,7 @@ extern int mems_ok_for_quad_peep (rtx, rtx);
 extern bool gpr_or_gpr_p (rtx, rtx);
 extern bool direct_move_p (rtx, rtx);
 extern bool quad_address_p (rtx, machine_mode, bool);
+extern bool mode_supports_dq_form (machine_mode);
 extern bool quad_load_store_p (rtx, rtx);
 extern bool fusion_gpr_load_p (rtx, rtx, rtx, rtx);
 extern void expand_fusion_gpr_load (rtx *);
@@ -344,6 +345,7 @@ class rtl_opt_pass;
 
 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
 extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
+extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
 extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
 extern bool rs6000_quadword_masked_address_p (const_rtx exp);
 extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
diff --git a/gcc/config/rs6000/rs6000-vecload-opt.cc 
b/gcc/config/rs6000/rs6000-vecload-opt.cc
new file mode 100644
index 000..63ee733af89
--- /dev/null
+++ b/gcc/config/rs6000/rs6000-vecload-opt.cc
@@ -0,0 +1,234 @@
+/* Subroutines used to replace lxv with lxvp
+   for p10 little-endian VSX code.
+   Copyright (C) 2020-2023 Free Software Foundation, Inc.
+   Contributed by Ajit Kumar Agarwal .
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distribut

PATCH v3] rs6000: fmr gets used instead of faster xxlor [PR93571]

2023-10-10 Thread Ajit Agarwal
Hello Segher:

Here is the patch that uses xxlor instead of fmr where possible.
Performance results shows that fmr is better in power9 and 
power10 architectures whereas xxlor is better in power7 and
power 8 architectures. fmr is the only option before p7.

Incorporated review comments.

Bootstrapped and regtested on powerpc64-linux-gnu

Thanks & Regards
Ajit

rs6000: Use xxlor instead of fmr where possible

Replaces fmr with xxlor instruction for power7 and power8
architectures whereas for power9 and power10 keep fmr
instruction.

Perf measurement results:

Power9 fmr:  201,847,661 cycles.
Power9 xxlor: 201,877,78 cycles.
Power8 fmr: 200,901,043 cycles.
Power8 xxlor: 201,020,518 cycles.
Power7 fmr: 201,059,524 cycles.
Power7 xxlor: 201,042,851 cycles.

2023-10-10  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/rs6000/rs6000.md (*movdf_hardfloat64): Use xxlor for power7
and power8 and fmr for power9 and power10.
---
 gcc/config/rs6000/rs6000.md | 45 -
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index d4337ce42a9..46982637d79 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -355,7 +355,7 @@
   (const (symbol_ref "(enum attr_cpu) rs6000_tune")))
 
 ;; The ISA we implement.
-(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p9,p9v,p9kf,p9tf,p10"
+(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p7p8v,p9,p9v,p9kf,p9tf,p10"
   (const_string "any"))
 
 ;; Is this alternative enabled for the current CPU/ISA/etc.?
@@ -403,6 +403,12 @@
  (and (eq_attr "isa" "p10")
  (match_test "TARGET_POWER10"))
  (const_int 1)
+
+ (and (eq_attr "isa" "p7p8v")
+ (match_test "rs6000_tune != PROCESSOR_POWER9
+  && TARGET_VSX && !TARGET_P9_VECTOR"))
+ (const_int 1)
+
 ] (const_int 0)))
 
 ;; If this instruction is microcoded on the CELL processor
@@ -8551,27 +8557,29 @@
 
 (define_insn "*mov_hardfloat64"
   [(set (match_operand:FMOVE64 0 "nonimmediate_operand"
-   "=m,   d,  d,  ,   wY,
- ,Z,  ,  ,  !r,
+   "=m,   d,  ,  ,   wY,
+ ,Z,  wa, ,  !r,
  YZ,  r,  !r, *c*l,   !r,
-*h,   r,  ,   wa")
+*h,   r,  ,   d,  wn,
+wa")
(match_operand:FMOVE64 1 "input_operand"
-"d,   m,  d,  wY, ,
- Z,   ,   ,  ,  ,
+"d,   m,  ,  wY, ,
+ Z,   ,   wa, ,  ,
  r,   YZ, r,  r,  *h,
- 0,   ,   r,  eP"))]
+ 0,   ,   r,  d,  wn,
+ eP"))]
   "TARGET_POWERPC64 && TARGET_HARD_FLOAT
&& (gpc_reg_operand (operands[0], mode)
|| gpc_reg_operand (operands[1], mode))"
   "@
stfd%U0%X0 %1,%0
lfd%U1%X1 %0,%1
-   fmr %0,%1
+   xxlor %x0,%x1,%x1
lxsd %0,%1
stxsd %1,%0
lxsdx %x0,%y1
stxsdx %x1,%y0
-   xxlor %x0,%x1,%x1
+   fmr %0,%1
xxlxor %x0,%x0,%x0
li %0,0
std%U0%X0 %1,%0
@@ -8582,23 +8590,28 @@
nop
mfvsrd %0,%x1
mtvsrd %x0,%1
+   fmr %0,%1
+   fmr %0,%1
#"
   [(set_attr "type"
-"fpstore, fpload, fpsimple,   fpload, fpstore,
+"fpstore, fpload, veclogical, fpload, fpstore,
  fpload,  fpstore,veclogical, veclogical, integer,
  store,   load,   *,  mtjmpr, mfjmpr,
- *,   mfvsr,  mtvsr,  vecperm")
+ *,   mfvsr,  mtvsr,  fpsimple,   fpsimple,
+ vecperm")
(set_attr "size" "64")
(set_attr "isa"
-"*,   *,  *,  p9v,p9v,
- p7v, p7v,*,  *,  *,
- *,   *,  *,  *,  *,
- *,   p8v,p8v,p10")
+"*,  *,  p7p8v,   p9v,p9v,
+ p7v, p7v,*,   *,  *,
+ *,   *,  *,   *,  *,
+ *,   p8v,p8v, *,  *,
+ p10")
(set_attr "prefixed"
 "*,   *,  *,  *,  *,
  *,   *,  *,  *,  *,
  *,   *,  *,  *,  *,
- *,   *,  *,  *")])
+ *,   *,  *,  *,  *,
+ *")])
 
 ;;   STD  LD   MR  MT MF G-const
 ;;   H-const  F-const  Special
-- 
2.39.3



[PATCH v8] tree-ssa-sink: Improve code sinking pass

2023-10-12 Thread Ajit Agarwal
This patch improves code sinking pass to sink statements before call to reduce
register pressure.
Review comments are incorporated. Synced and modified with latest trunk sources.

For example :

void bar();
int j;
void foo(int a, int b, int c, int d, int e, int f)
{
  int l;
  l = a + b + c + d +e + f;
  if (a != 5)
{
  bar();
  j = l;
}
}

Code Sinking does the following:

void bar();
int j;
void foo(int a, int b, int c, int d, int e, int f)
{
  int l;

  if (a != 5)
{
  l = a + b + c + d +e + f;
  bar();
  j = l;
}
}

Bootstrapped regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit

tree-ssa-sink: Improve code sinking pass

Currently, code sinking will sink code after function calls.  This increases
register pressure for callee-saved registers.  The following patch improves
code sinking by placing the sunk code before calls in the use block or in
the immediate dominator of the use blocks.

2023-10-12  Ajit Kumar Agarwal  

gcc/ChangeLog:

PR tree-optimization/81953
* tree-ssa-sink.cc (statement_sink_location): Move statements before
calls.
(select_best_block): Add heuristics to select the best blocks in the
immediate post dominator.

gcc/testsuite/ChangeLog:

PR tree-optimization/81953
* gcc.dg/tree-ssa/ssa-sink-20.c: New test.
* gcc.dg/tree-ssa/ssa-sink-21.c: New test.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c | 15 
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c | 19 ++
 gcc/tree-ssa-sink.cc| 39 -
 3 files changed, 56 insertions(+), 17 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
new file mode 100644
index 000..d3b79ca5803
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+void bar();
+int j;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump 
{l_12\s+=\s+_4\s+\+\s+f_11\(D\);\n\s+bar\s+\(\)} sink1 } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
new file mode 100644
index 000..84e7938c54f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+void bar();
+int j, x;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  if (b != 3)
+x = 3;
+  else
+x = 5;
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump 
{l_13\s+=\s+_4\s+\+\s+f_12\(D\);\n\s+bar\s+\(\)} sink1 } } */
diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
index a360c5cdd6e..95298bc8402 100644
--- a/gcc/tree-ssa-sink.cc
+++ b/gcc/tree-ssa-sink.cc
@@ -174,7 +174,8 @@ nearest_common_dominator_of_uses (def_operand_p def_p, bool 
*debug_stmts)
 
 /* Given EARLY_BB and LATE_BB, two blocks in a path through the dominator
tree, return the best basic block between them (inclusive) to place
-   statements.
+   statements. The best basic block should be an immediate dominator of
+   best basic block if the use stmt is after the call.
 
We want the most control dependent block in the shallowest loop nest.
 
@@ -196,6 +197,16 @@ select_best_block (basic_block early_bb,
   basic_block best_bb = late_bb;
   basic_block temp_bb = late_bb;
   int threshold;
+  /* Get the sinking threshold.  If the statement to be moved has memory
+ operands, then increase the threshold by 7% as those are even more
+ profitable to avoid, clamping at 100%.  */
+  threshold = param_sink_frequency_threshold;
+  if (gimple_vuse (stmt) || gimple_vdef (stmt))
+{
+  threshold += 7;
+  if (threshold > 100)
+   threshold = 100;
+}
 
   while (temp_bb != early_bb)
 {
@@ -204,6 +215,14 @@ select_best_block (basic_block early_bb,
   if (bb_loop_depth (temp_bb) < bb_loop_depth (best_bb))
best_bb = temp_bb;
 
+  /* if we have temp_bb post dominated by use block block then immediate
+   * dominator would be our best block.  */
+  if (!gimple_vuse (stmt)
+ && bb_loop_depth (temp_bb) == bb_loop_depth (early_bb)
+ && !(temp_bb->count * 100 >= early_bb->count * threshold)
+ && dominated_by_p (CDI_DOMINATORS, late_bb, temp_bb))
+   best_bb = temp_bb;
+
   /* Walk up the dominator tree, hopefully we'll find a shallower
 loop nest.  */
   temp_bb = get_immediate_dominator (CDI_DOMINATORS, temp_bb);
@@ -233,17 +252,6 @@ select_best_block (basic_block early_bb,
   && !dominated_by_p (CDI_DOMINATORS, b

[PATCH v9] Improve code sinking pass

2023-10-12 Thread Ajit Agarwal
This patch improves code sinking pass to sink statements before call to reduce
register pressure.
Review comments are incorporated. Synced with latest sources and modify the 
code changes
accordingly.

For example :

void bar();
int j;
void foo(int a, int b, int c, int d, int e, int f)
{
  int l;
  l = a + b + c + d +e + f;
  if (a != 5)
{
  bar();
  j = l;
}
}

Code Sinking does the following:

void bar();
int j;
void foo(int a, int b, int c, int d, int e, int f)
{
  int l;

  if (a != 5)
{
  l = a + b + c + d +e + f;
  bar();
  j = l;
}
}

Bootstrapped regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit

tree-ssa-sink: Improve code sinking pass

Currently, code sinking will sink code after function calls.  This increases
register pressure for callee-saved registers.  The following patch improves
code sinking by placing the sunk code before calls in the use block or in
the immediate dominator of the use blocks.

2023-10-12  Ajit Kumar Agarwal  

gcc/ChangeLog:

PR tree-optimization/81953
* tree-ssa-sink.cc (statement_sink_location): Move statements before
calls.
(select_best_block): Add heuristics to select the best blocks in the
immediate post dominator.

gcc/testsuite/ChangeLog:

PR tree-optimization/81953
* gcc.dg/tree-ssa/ssa-sink-20.c: New test.
* gcc.dg/tree-ssa/ssa-sink-21.c: New test.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c | 15 
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c | 19 ++
 gcc/tree-ssa-sink.cc| 39 -
 3 files changed, 56 insertions(+), 17 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
new file mode 100644
index 000..d3b79ca5803
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+void bar();
+int j;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump 
{l_12\s+=\s+_4\s+\+\s+f_11\(D\);\n\s+bar\s+\(\)} sink1 } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
new file mode 100644
index 000..84e7938c54f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+void bar();
+int j, x;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  if (b != 3)
+x = 3;
+  else
+x = 5;
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump 
{l_13\s+=\s+_4\s+\+\s+f_12\(D\);\n\s+bar\s+\(\)} sink1 } } */
diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
index a360c5cdd6e..95298bc8402 100644
--- a/gcc/tree-ssa-sink.cc
+++ b/gcc/tree-ssa-sink.cc
@@ -174,7 +174,8 @@ nearest_common_dominator_of_uses (def_operand_p def_p, bool 
*debug_stmts)
 
 /* Given EARLY_BB and LATE_BB, two blocks in a path through the dominator
tree, return the best basic block between them (inclusive) to place
-   statements.
+   statements. The best basic block should be an immediate dominator of
+   best basic block if the use stmt is after the call.
 
We want the most control dependent block in the shallowest loop nest.
 
@@ -196,6 +197,16 @@ select_best_block (basic_block early_bb,
   basic_block best_bb = late_bb;
   basic_block temp_bb = late_bb;
   int threshold;
+  /* Get the sinking threshold.  If the statement to be moved has memory
+ operands, then increase the threshold by 7% as those are even more
+ profitable to avoid, clamping at 100%.  */
+  threshold = param_sink_frequency_threshold;
+  if (gimple_vuse (stmt) || gimple_vdef (stmt))
+{
+  threshold += 7;
+  if (threshold > 100)
+   threshold = 100;
+}
 
   while (temp_bb != early_bb)
 {
@@ -204,6 +215,14 @@ select_best_block (basic_block early_bb,
   if (bb_loop_depth (temp_bb) < bb_loop_depth (best_bb))
best_bb = temp_bb;
 
+  /* if we have temp_bb post dominated by use block block then immediate
+   * dominator would be our best block.  */
+  if (!gimple_vuse (stmt)
+ && bb_loop_depth (temp_bb) == bb_loop_depth (early_bb)
+ && !(temp_bb->count * 100 >= early_bb->count * threshold)
+ && dominated_by_p (CDI_DOMINATORS, late_bb, temp_bb))
+   best_bb = temp_bb;
+
   /* Walk up the dominator tree, hopefully we'll find a shallower
 loop nest.  */
   temp_bb = get_immediate_dominator (CDI_DOMINATORS, temp_bb);
@@ -233,17 +252,6 @@ select_best_block (basic_block early_bb,
   && !dominated_b

[PATCH v9] tree-ssa-sink: Improve code sinking pass

2023-10-12 Thread Ajit Agarwal
This patch improves code sinking pass to sink statements before call to reduce
register pressure.
Review comments are incorporated.
Synced with latest trunk sources and modify the sinking pass accordingly.

For example :

void bar();
int j;
void foo(int a, int b, int c, int d, int e, int f)
{
  int l;
  l = a + b + c + d +e + f;
  if (a != 5)
{
  bar();
  j = l;
}
}

Code Sinking does the following:

void bar();
int j;
void foo(int a, int b, int c, int d, int e, int f)
{
  int l;

  if (a != 5)
{
  l = a + b + c + d +e + f;
  bar();
  j = l;
}
}

Bootstrapped regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit

tree-ssa-sink: Improve code sinking pass

Currently, code sinking will sink code after function calls.  This increases
register pressure for callee-saved registers.  The following patch improves
code sinking by placing the sunk code before calls in the use block or in
the immediate dominator of the use blocks.

2023-10-12  Ajit Kumar Agarwal  

gcc/ChangeLog:

PR tree-optimization/81953
* tree-ssa-sink.cc (statement_sink_location): Move statements before
calls.
(select_best_block): Add heuristics to select the best blocks in the
immediate post dominator.

gcc/testsuite/ChangeLog:

PR tree-optimization/81953
* gcc.dg/tree-ssa/ssa-sink-20.c: New test.
* gcc.dg/tree-ssa/ssa-sink-21.c: New test.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c | 15 
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c | 19 ++
 gcc/tree-ssa-sink.cc| 39 -
 3 files changed, 56 insertions(+), 17 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
new file mode 100644
index 000..d3b79ca5803
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+void bar();
+int j;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump 
{l_12\s+=\s+_4\s+\+\s+f_11\(D\);\n\s+bar\s+\(\)} sink1 } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
new file mode 100644
index 000..84e7938c54f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+void bar();
+int j, x;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  if (b != 3)
+x = 3;
+  else
+x = 5;
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump 
{l_13\s+=\s+_4\s+\+\s+f_12\(D\);\n\s+bar\s+\(\)} sink1 } } */
diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
index a360c5cdd6e..95298bc8402 100644
--- a/gcc/tree-ssa-sink.cc
+++ b/gcc/tree-ssa-sink.cc
@@ -174,7 +174,8 @@ nearest_common_dominator_of_uses (def_operand_p def_p, bool 
*debug_stmts)
 
 /* Given EARLY_BB and LATE_BB, two blocks in a path through the dominator
tree, return the best basic block between them (inclusive) to place
-   statements.
+   statements. The best basic block should be an immediate dominator of
+   best basic block if the use stmt is after the call.
 
We want the most control dependent block in the shallowest loop nest.
 
@@ -196,6 +197,16 @@ select_best_block (basic_block early_bb,
   basic_block best_bb = late_bb;
   basic_block temp_bb = late_bb;
   int threshold;
+  /* Get the sinking threshold.  If the statement to be moved has memory
+ operands, then increase the threshold by 7% as those are even more
+ profitable to avoid, clamping at 100%.  */
+  threshold = param_sink_frequency_threshold;
+  if (gimple_vuse (stmt) || gimple_vdef (stmt))
+{
+  threshold += 7;
+  if (threshold > 100)
+   threshold = 100;
+}
 
   while (temp_bb != early_bb)
 {
@@ -204,6 +215,14 @@ select_best_block (basic_block early_bb,
   if (bb_loop_depth (temp_bb) < bb_loop_depth (best_bb))
best_bb = temp_bb;
 
+  /* if we have temp_bb post dominated by use block block then immediate
+   * dominator would be our best block.  */
+  if (!gimple_vuse (stmt)
+ && bb_loop_depth (temp_bb) == bb_loop_depth (early_bb)
+ && !(temp_bb->count * 100 >= early_bb->count * threshold)
+ && dominated_by_p (CDI_DOMINATORS, late_bb, temp_bb))
+   best_bb = temp_bb;
+
   /* Walk up the dominator tree, hopefully we'll find a shallower
 loop nest.  */
   temp_bb = get_immediate_dominator (CDI_DOMINATORS, temp_bb);
@@ -233,17 +252,6 @@ select_best_block (basic_block early_bb,
   && !domina

[PING ^0][PATCH v3] rs6000: fmr gets used instead of faster xxlor [PR93571]

2023-10-15 Thread Ajit Agarwal


Hello Segher:


Please review.

Thanks & Regards
Ajit

 Forwarded Message 
Subject: PATCH v3] rs6000: fmr gets used instead of faster xxlor [PR93571]
Date: Tue, 10 Oct 2023 18:14:00 +0530
From: Ajit Agarwal 
To: gcc-patches 
CC: Segher Boessenkool , Peter Bergner 
, Kewen.Lin 

Hello Segher:

Here is the patch that uses xxlor instead of fmr where possible.
Performance results shows that fmr is better in power9 and 
power10 architectures whereas xxlor is better in power7 and
power 8 architectures. fmr is the only option before p7.

Incorporated review comments.

Bootstrapped and regtested on powerpc64-linux-gnu

Thanks & Regards
Ajit

rs6000: Use xxlor instead of fmr where possible

Replaces fmr with xxlor instruction for power7 and power8
architectures whereas for power9 and power10 keep fmr
instruction.

Perf measurement results:

Power9 fmr:  201,847,661 cycles.
Power9 xxlor: 201,877,78 cycles.
Power8 fmr: 200,901,043 cycles.
Power8 xxlor: 201,020,518 cycles.
Power7 fmr: 201,059,524 cycles.
Power7 xxlor: 201,042,851 cycles.

2023-10-10  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/rs6000/rs6000.md (*movdf_hardfloat64): Use xxlor for power7
and power8 and fmr for power9 and power10.
---
 gcc/config/rs6000/rs6000.md | 45 -
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index d4337ce42a9..46982637d79 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -355,7 +355,7 @@
   (const (symbol_ref "(enum attr_cpu) rs6000_tune")))
 
 ;; The ISA we implement.
-(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p9,p9v,p9kf,p9tf,p10"
+(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p7p8v,p9,p9v,p9kf,p9tf,p10"
   (const_string "any"))
 
 ;; Is this alternative enabled for the current CPU/ISA/etc.?
@@ -403,6 +403,12 @@
  (and (eq_attr "isa" "p10")
  (match_test "TARGET_POWER10"))
  (const_int 1)
+
+ (and (eq_attr "isa" "p7p8v")
+ (match_test "rs6000_tune != PROCESSOR_POWER9
+  && TARGET_VSX && !TARGET_P9_VECTOR"))
+ (const_int 1)
+
 ] (const_int 0)))
 
 ;; If this instruction is microcoded on the CELL processor
@@ -8551,27 +8557,29 @@
 
 (define_insn "*mov_hardfloat64"
   [(set (match_operand:FMOVE64 0 "nonimmediate_operand"
-   "=m,   d,  d,  ,   wY,
- ,Z,  ,  ,  !r,
+   "=m,   d,  ,  ,   wY,
+ ,Z,  wa, ,  !r,
  YZ,  r,  !r, *c*l,   !r,
-*h,   r,  ,   wa")
+*h,   r,  ,   d,  wn,
+wa")
(match_operand:FMOVE64 1 "input_operand"
-"d,   m,  d,  wY, ,
- Z,   ,   ,  ,  ,
+"d,   m,  ,  wY, ,
+ Z,   ,   wa, ,  ,
  r,   YZ, r,  r,  *h,
- 0,   ,   r,  eP"))]
+ 0,   ,   r,  d,  wn,
+ eP"))]
   "TARGET_POWERPC64 && TARGET_HARD_FLOAT
&& (gpc_reg_operand (operands[0], mode)
|| gpc_reg_operand (operands[1], mode))"
   "@
stfd%U0%X0 %1,%0
lfd%U1%X1 %0,%1
-   fmr %0,%1
+   xxlor %x0,%x1,%x1
lxsd %0,%1
stxsd %1,%0
lxsdx %x0,%y1
stxsdx %x1,%y0
-   xxlor %x0,%x1,%x1
+   fmr %0,%1
xxlxor %x0,%x0,%x0
li %0,0
std%U0%X0 %1,%0
@@ -8582,23 +8590,28 @@
nop
mfvsrd %0,%x1
mtvsrd %x0,%1
+   fmr %0,%1
+   fmr %0,%1
#"
   [(set_attr "type"
-"fpstore, fpload, fpsimple,   fpload, fpstore,
+"fpstore, fpload, veclogical, fpload, fpstore,
  fpload,  fpstore,veclogical, veclogical, integer,
  store,   load,   *,  mtjmpr, mfjmpr,
- *,   mfvsr,  mtvsr,  vecperm")
+ *,   mfvsr,  mtvsr,  fpsimple,   fpsimple,
+ vecperm")
(set_attr "size" "64")
(set_attr "isa"
-"*,   *,  *,  p9v,p9v,
- p7v, p7v,*,  *,  *,
- *,   *,  *,  *,  *,
- *,   p8v,p8v,p10

[PING ^0][PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp

2023-10-15 Thread Ajit Agarwal
Hello All:

Please review.

Thanks & Regards
Ajit


 Forwarded Message 
Subject: [PATCH v2] rs6000: Add new pass for replacement of contiguous 
addresses vector load lxv with lxvp
Date: Sun, 8 Oct 2023 00:34:27 +0530
From: Ajit Agarwal 
To: gcc-patches 
CC: Segher Boessenkool , Peter Bergner 
, Kewen.Lin 

Hello All:

This patch add new pass to replace contiguous addresses vector load lxv with 
mma instruction
lxvp. This patch addresses one regressions failure in ARM architecture.

Bootstrapped and regtested with powepc64-linux-gnu.

Thanks & Regards
Ajit


rs6000: Add new pass for replacement of contiguous lxv with lxvp.

New pass to replace contiguous addresses lxv with lxvp. This pass
is registered after ree rtl pass.

2023-10-07  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/rs6000/rs6000-passes.def: Registered vecload pass.
* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
* config.gcc: Add new executable.
* config/rs6000/rs6000-protos.h: Add new prototype for vecload
pass.
* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
* config/rs6000/t-rs6000: Add new rule.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/vecload.C: New test.
---
 gcc/config.gcc |   4 +-
 gcc/config/rs6000/rs6000-passes.def|   1 +
 gcc/config/rs6000/rs6000-protos.h  |   2 +
 gcc/config/rs6000/rs6000-vecload-opt.cc| 234 +
 gcc/config/rs6000/rs6000.cc|   3 +-
 gcc/config/rs6000/t-rs6000 |   4 +
 gcc/testsuite/g++.target/powerpc/vecload.C |  15 ++
 7 files changed, 260 insertions(+), 3 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C

diff --git a/gcc/config.gcc b/gcc/config.gcc
index ee46d96bf62..482ab094b89 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -515,7 +515,7 @@ or1k*-*-*)
;;
 powerpc*-*-*)
cpu_type=rs6000
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
@@ -552,7 +552,7 @@ riscv*)
;;
 rs6000*-*-*)
extra_options="${extra_options} g.opt fused-madd.opt 
rs6000/rs6000-tables.opt"
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-logue.cc 
\$(srcdir)/config/rs6000/rs6000-call.cc"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
diff --git a/gcc/config/rs6000/rs6000-passes.def 
b/gcc/config/rs6000/rs6000-passes.def
index ca899d5f7af..9ecf8ce6a9c 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -28,6 +28,7 @@ along with GCC; see the file COPYING3.  If not see
  The power8 does not have instructions that automaticaly do the byte swaps
  for loads and stores.  */
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
+  INSERT_PASS_AFTER (pass_ree, 1, pass_analyze_vecload);
 
   /* Pass to do the PCREL_OPT optimization that combines the load of an
  external symbol's address along with a single load or store using that
diff --git a/gcc/config/rs6000/rs6000-protos.h 
b/gcc/config/rs6000/rs6000-protos.h
index f70118ea40f..9c44bae33d3 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -91,6 +91,7 @@ extern int mems_ok_for_quad_peep (rtx, rtx);
 extern bool gpr_or_gpr_p (rtx, rtx);
 extern bool direct_move_p (rtx, rtx);
 extern bool quad_address_p (rtx, machine_mode, bool);
+extern bool mode_supports_dq_form (machine_mode);
 extern bool quad_load_store_p (rtx, rtx);
 extern bool fusion_gpr_load_p (rtx, rtx, rtx, rtx);
 extern void expand_fusion_gpr_load (rtx *);
@@ -344,6 +345,7 @@ class rtl_opt_pass;
 
 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
 extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
+extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
 extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
 extern bool rs6000_quadword_masked_address_p (const_rtx exp);
 extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
diff --git a/gcc/config/rs6000/rs6000-vecload-opt.cc 
b/gcc/config/rs6000/rs6000-vecload-opt.cc
new file mode 100644
index 000..63ee733af89
--- /dev/null
+++ b/gcc/config/rs6000/rs6000-vecload-opt.cc
@@ -0,0 +1,234 @@
+/* Subroutines 

[PING ^0] [PATCH v8 4/4] ree: Improve ree pass for rs6000 target using defined ABI interfaces

2023-10-15 Thread Ajit Agarwal


Hello All:

Please review and update with your comments so that code changes are committed 
in trunk.

Thanks & Regards
Ajit

 Forwarded Message 
Subject: [PATCH v8 4/4] ree: Improve ree pass for rs6000 target using defined 
ABI interfaces
Date: Wed, 20 Sep 2023 02:40:49 +0530
From: Ajit Agarwal 
To: gcc-patches 
CC: Jeff Law , Vineet Gupta , 
Richard Biener , Segher Boessenkool 
, Peter Bergner 

Hello All:

This version 8 of the patch uses abi interfaces to remove zero and sign 
extension elimination.
Bootstrapped and regtested on powerpc-linux-gnu.

Incorporated all the review comments of version 6. Added sign extension 
elimination using abi 
interfaces.

Thanks & Regards
Ajit

ree: Improve ree pass for rs6000 target using defined abi interfaces

For rs6000 target we see redundant zero and sign extension and done
to improve ree pass to eliminate such redundant zero and sign extension
using defined ABI interfaces.

2023-09-20  Ajit Kumar Agarwal  

gcc/ChangeLog:

* ree.cc (combine_reaching_defs): Use of zero_extend and sign_extend
defined abi interfaces.
(add_removable_extension): Use of defined abi interfaces for no
reaching defs.
(abi_extension_candidate_return_reg_p): New function.
(abi_extension_candidate_p): New function.
(abi_extension_candidate_argno_p): New function.
(abi_handle_regs_without_defs_p): New function.
(abi_target_promote_function_mode): New function.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/zext-elim-3.C
---
 gcc/ree.cc| 161 +-
 .../g++.target/powerpc/zext-elim-3.C  |  13 ++
 2 files changed, 171 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-3.C

diff --git a/gcc/ree.cc b/gcc/ree.cc
index fc04249fa84..e833db2432d 100644
--- a/gcc/ree.cc
+++ b/gcc/ree.cc
@@ -514,7 +514,8 @@ get_uses (rtx_insn *insn, rtx reg)
 if (REGNO (DF_REF_REG (def)) == REGNO (reg))
   break;
 
-  gcc_assert (def != NULL);
+  if (def == NULL)
+return NULL;
 
   ref_chain = DF_REF_CHAIN (def);
 
@@ -750,6 +751,134 @@ get_extended_src_reg (rtx src)
   return src;
 }
 
+/* Return TRUE if target mode is equal to source mode of zero_extend
+   or sign_extend otherwise false.  */
+
+static bool
+abi_target_promote_function_mode (machine_mode mode)
+{
+  int unsignedp;
+  machine_mode tgt_mode
+= targetm.calls.promote_function_mode (NULL_TREE, mode, &unsignedp,
+  NULL_TREE, 1);
+
+  if (tgt_mode == mode)
+return true;
+  else
+return false;
+}
+
+/* Return TRUE if the candidate insn is zero extend and regno is
+   a return registers.  */
+
+static bool
+abi_extension_candidate_return_reg_p (rtx_insn *insn, int regno)
+{
+  rtx set = single_set (insn);
+  rtx src = SET_SRC (set);
+
+  if (GET_CODE (src) != ZERO_EXTEND && GET_CODE (src) != SIGN_EXTEND)
+return false;
+
+  if (targetm.calls.function_value_regno_p (regno))
+return true;
+
+  return false;
+}
+
+/* Return TRUE if reg source operand of zero_extend is argument registers
+   and not return registers and source and destination operand are same
+   and mode of source and destination operand are not same.  */
+
+static bool
+abi_extension_candidate_p (rtx_insn *insn)
+{
+  rtx set = single_set (insn);
+  rtx src = SET_SRC (set);
+
+  if (GET_CODE (src) != ZERO_EXTEND && GET_CODE (src) != SIGN_EXTEND)
+return false;
+
+  machine_mode dst_mode = GET_MODE (SET_DEST (set));
+  rtx orig_src = XEXP (SET_SRC (set), 0);
+
+  if (!FUNCTION_ARG_REGNO_P (REGNO (orig_src))
+  || abi_extension_candidate_return_reg_p (insn, REGNO (orig_src)))
+return false;
+
+  /* Mode of destination and source of zero_extend should be different.  */
+  if (dst_mode == GET_MODE (orig_src))
+return false;
+
+  /* REGNO of source and destination of zero_extend should be same.  */
+  if (REGNO (SET_DEST (set)) != REGNO (orig_src))
+return false;
+
+  return true;
+}
+
+/* Return TRUE if the candidate insn is zero extend and regno is
+   an argument registers.  */
+
+static bool
+abi_extension_candidate_argno_p (rtx_code code, int regno)
+{
+  if (code != ZERO_EXTEND && code != SIGN_EXTEND)
+return false;
+
+  if (FUNCTION_ARG_REGNO_P (regno))
+return true;
+
+  return false;
+}
+
+/* Return TRUE if the candidate insn doesn't have defs and have
+ * uses without RTX_BIN_ARITH/RTX_COMM_ARITH/RTX_UNARY rtx class.  */
+
+static bool
+abi_handle_regs (rtx_insn *insn)
+{
+  if (side_effects_p (PATTERN (insn)))
+return false;
+
+  struct df_link *uses = get_uses (insn, SET_DEST (PATTERN (insn)));
+
+  if (!uses)
+return false;
+
+  for (df_link *use = uses; use; use = use->next)
+{
+  if (!use->ref)
+   return false;
+
+  if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (DF_REF_INSN (use->ref)))
+   return false;
+

[PING ^0] [PATCH v2 3/4] Improve functionality of ree pass with various constants with AND operation.

2023-10-15 Thread Ajit Agarwal



Hello All:

Please review. In this patch I have different modes and constants that are 
supported in ree pass for 
sign and zero extension eliminations.

Please review and update with your comments so that it will be committed in 
trunk.

Thanks & Regards
Ajit
 Forwarded Message 
Subject: [PATCH v2 3/4] Improve functionality of ree pass with various 
constants with AND operation.
Date: Tue, 19 Sep 2023 14:51:16 +0530
From: Ajit Agarwal 
To: gcc-patches 
CC: Jeff Law , Vineet Gupta , 
Richard Biener , Peter Bergner 
, Segher Boessenkool 


Hello Jeff:

This patch eliminates redundant zero and sign extension with ree pass for rs6000
target.

Bootstrapped and regtested for powerpc64-linux-gnu.

Thanks & Regards
Ajit


ree: Improve ree pass

For rs6000 target we see redundant zero and sign extension and ree pass
s improved to eliminate such redundant zero and sign extension. Support of
zero_extend/sign_extend/AND.

2023-09-04  Ajit Kumar Agarwal  

gcc/ChangeLog:

* ree.cc (eliminate_across_bbs_p): Add checks to enable extension
elimination across and within basic blocks.
(def_arith_p): New function to check definition has arithmetic
operation.
(combine_set_extension): Modification to incorporate AND
and current zero_extend and sign_extend instruction.
(merge_def_and_ext): Add calls to eliminate_across_bbs_p and
zero_extend sign_extend and AND instruction.
(rtx_is_zext_p): New function.
(feasible_cfg): New function.
* rtl.h (reg_used_set_between_p): Add prototype.
* rtlanal.cc (reg_used_set_between_p): New function.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/zext-elim.C: New testcase.
* g++.target/powerpc/zext-elim-1.C: New testcase.
* g++.target/powerpc/zext-elim-2.C: New testcase.
* g++.target/powerpc/sext-elim.C: New testcase.
---
 gcc/ree.cc| 487 --
 gcc/rtl.h |   1 +
 gcc/rtlanal.cc|  15 +
 gcc/testsuite/g++.target/powerpc/sext-elim.C  |  17 +
 .../g++.target/powerpc/zext-elim-1.C  |  19 +
 .../g++.target/powerpc/zext-elim-2.C  |  11 +
 gcc/testsuite/g++.target/powerpc/zext-elim.C  |  30 ++
 7 files changed, 534 insertions(+), 46 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/powerpc/sext-elim.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-1.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-2.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim.C

diff --git a/gcc/ree.cc b/gcc/ree.cc
index fc04249fa84..931b9b08821 100644
--- a/gcc/ree.cc
+++ b/gcc/ree.cc
@@ -253,6 +253,77 @@ struct ext_cand
 
 static int max_insn_uid;
 
+/* Return TRUE if OP can be considered a zero extension from one or
+   more sub-word modes to larger modes up to a full word.
+
+   For example (and:DI (reg) (const_int X))
+
+   Depending on the value of X could be considered a zero extension
+   from QI, HI and SI to larger modes up to DImode.  */
+
+static bool
+rtx_is_zext_p (rtx insn)
+{
+  if (GET_CODE (insn) == AND)
+{
+  rtx set = XEXP (insn, 0);
+  if (REG_P (set))
+   {
+ rtx src = XEXP (insn, 1);
+ machine_mode m_mode = GET_MODE (set);
+
+ if (CONST_INT_P (src)
+ && (INTVAL (src) == 1
+ || (m_mode == QImode && INTVAL (src) == 0x7)
+ || (m_mode == QImode && INTVAL (src) == 0x007F)
+ || (m_mode == HImode && INTVAL (src) == 0x7FFF)
+ || (m_mode == SImode && INTVAL (src) == 0x007F)))
+   return true;
+
+   }
+  else
+   return false;
+}
+
+  return false;
+}
+/* Return TRUE if OP can be considered a zero extension from one or
+   more sub-word modes to larger modes up to a full word.
+
+   For example (and:DI (reg) (const_int X))
+
+   Depending on the value of X could be considered a zero extension
+   from QI, HI and SI to larger modes up to DImode.  */
+
+static bool
+rtx_is_zext_p (rtx_insn *insn)
+{
+  rtx body = single_set (insn);
+
+  if (GET_CODE (body) == SET && GET_CODE (SET_SRC (body)) == AND)
+   {
+ rtx set = XEXP (SET_SRC (body), 0);
+
+ if (REG_P (set) && GET_MODE (SET_DEST (body)) == GET_MODE (set))
+   {
+ rtx src = XEXP (SET_SRC (body), 1);
+ machine_mode m_mode = GET_MODE (set);
+
+ if (CONST_INT_P (src)
+ && (INTVAL (src) == 1
+ || (m_mode == QImode && INTVAL (src) == 0x7)
+ || (m_mode == QImode && INTVAL (src) == 0x007F)
+ || (m_mode == HImode && INTVAL (src) == 0x7FFF)
+ || (m_mode == SImode && INTVAL (src) == 0x007F)))
+   return true;
+   }
+ else
+  return false;

Re: [PATCH v8] tree-ssa-sink: Improve code sinking pass

2023-10-17 Thread Ajit Agarwal
Hello Richard:

On 17/10/23 2:03 pm, Richard Biener wrote:
> On Thu, Oct 12, 2023 at 10:42 AM Ajit Agarwal  wrote:
>>
>> This patch improves code sinking pass to sink statements before call to 
>> reduce
>> register pressure.
>> Review comments are incorporated. Synced and modified with latest trunk 
>> sources.
>>
>> For example :
>>
>> void bar();
>> int j;
>> void foo(int a, int b, int c, int d, int e, int f)
>> {
>>   int l;
>>   l = a + b + c + d +e + f;
>>   if (a != 5)
>> {
>>   bar();
>>   j = l;
>> }
>> }
>>
>> Code Sinking does the following:
>>
>> void bar();
>> int j;
>> void foo(int a, int b, int c, int d, int e, int f)
>> {
>>   int l;
>>
>>   if (a != 5)
>> {
>>   l = a + b + c + d +e + f;
>>   bar();
>>   j = l;
>> }
>> }
>>
>> Bootstrapped regtested on powerpc64-linux-gnu.
>>
>> Thanks & Regards
>> Ajit
>>
>> tree-ssa-sink: Improve code sinking pass
>>
>> Currently, code sinking will sink code after function calls.  This increases
>> register pressure for callee-saved registers.  The following patch improves
>> code sinking by placing the sunk code before calls in the use block or in
>> the immediate dominator of the use blocks.
> 
> The patch no longer does what the description above says.
Why you think so. Please let me know.
> 
> More comments below.
> 
>> 2023-10-12  Ajit Kumar Agarwal  
>>
>> gcc/ChangeLog:
>>
>> PR tree-optimization/81953
>> * tree-ssa-sink.cc (statement_sink_location): Move statements before
>> calls.
>> (select_best_block): Add heuristics to select the best blocks in the
>> immediate post dominator.
>>
>> gcc/testsuite/ChangeLog:
>>
>> PR tree-optimization/81953
>> * gcc.dg/tree-ssa/ssa-sink-20.c: New test.
>> * gcc.dg/tree-ssa/ssa-sink-21.c: New test.
>> ---
>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c | 15 
>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c | 19 ++
>>  gcc/tree-ssa-sink.cc| 39 -
>>  3 files changed, 56 insertions(+), 17 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
>>
>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c 
>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
>> new file mode 100644
>> index 000..d3b79ca5803
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
>> @@ -0,0 +1,15 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -fdump-tree-sink-stats" } */
>> +void bar();
>> +int j;
>> +void foo(int a, int b, int c, int d, int e, int f)
>> +{
>> +  int l;
>> +  l = a + b + c + d +e + f;
>> +  if (a != 5)
>> +{
>> +  bar();
>> +  j = l;
>> +}
>> +}
>> +/* { dg-final { scan-tree-dump 
>> {l_12\s+=\s+_4\s+\+\s+f_11\(D\);\n\s+bar\s+\(\)} sink1 } } */
>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c 
>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
>> new file mode 100644
>> index 000..84e7938c54f
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
>> @@ -0,0 +1,19 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -fdump-tree-sink-stats" } */
>> +void bar();
>> +int j, x;
>> +void foo(int a, int b, int c, int d, int e, int f)
>> +{
>> +  int l;
>> +  l = a + b + c + d +e + f;
>> +  if (a != 5)
>> +{
>> +  bar();
>> +  if (b != 3)
>> +x = 3;
>> +  else
>> +x = 5;
>> +  j = l;
>> +}
>> +}
>> +/* { dg-final { scan-tree-dump 
>> {l_13\s+=\s+_4\s+\+\s+f_12\(D\);\n\s+bar\s+\(\)} sink1 } } */
>> diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
>> index a360c5cdd6e..95298bc8402 100644
>> --- a/gcc/tree-ssa-sink.cc
>> +++ b/gcc/tree-ssa-sink.cc
>> @@ -174,7 +174,8 @@ nearest_common_dominator_of_uses (def_operand_p def_p, 
>> bool *debug_stmts)
>>
>>  /* Given EARLY_BB and LATE_BB, two blocks in a path through the dominator
>> tree, return the best basic block between them (inclusive) to place
>> -   statements.
>> +   statements. The best basic block should be an immediate dominator of
>> +   best basic block if the us

[PATCH v10] tree-ssa-sink: Improve code sinking pass

2023-10-17 Thread Ajit Agarwal
Currently, code sinking will sink code at the use points with loop having same
nesting depth. The following patch improves code sinking by placing the sunk
code in immediate dominator with same loop nest depth.

Review comments are incorporated.

For example :

void bar();
int j;
void foo(int a, int b, int c, int d, int e, int f)
{
  int l;
  l = a + b + c + d +e + f;
  if (a != 5)
{
  bar();
  j = l;
}
}

Code Sinking does the following:

void bar();
int j;
void foo(int a, int b, int c, int d, int e, int f)
{
  int l;

  if (a != 5)
{
  l = a + b + c + d +e + f;
  bar();
  j = l;
}
}

Bootstrapped regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit


tree-ssa-sink: Improve code sinking pass

Currently, code sinking will sink code at the use points with loop having same
nesting depth. The following patch improves code sinking by placing the sunk
code in immediate dominator with same loop nest depth.

2023-10-17  Ajit Kumar Agarwal  

gcc/ChangeLog:

PR tree-optimization/81953
* tree-ssa-sink.cc (statement_sink_location): Move statements with
same loop nest depth.
(select_best_block): Add heuristics to select the best blocks in the
immediate dominator.

gcc/testsuite/ChangeLog:

PR tree-optimization/81953
* gcc.dg/tree-ssa/ssa-sink-21.c: New test.
* gcc.dg/tree-ssa/ssa-sink-22.c: New test.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c | 15 +++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c | 19 +++
 gcc/tree-ssa-sink.cc| 16 +++-
 3 files changed, 45 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
new file mode 100644
index 000..d3b79ca5803
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+void bar();
+int j;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump 
{l_12\s+=\s+_4\s+\+\s+f_11\(D\);\n\s+bar\s+\(\)} sink1 } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
new file mode 100644
index 000..84e7938c54f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+void bar();
+int j, x;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  if (b != 3)
+x = 3;
+  else
+x = 5;
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump 
{l_13\s+=\s+_4\s+\+\s+f_12\(D\);\n\s+bar\s+\(\)} sink1 } } */
diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
index a360c5cdd6e..d96df0d81e9 100644
--- a/gcc/tree-ssa-sink.cc
+++ b/gcc/tree-ssa-sink.cc
@@ -176,6 +176,9 @@ nearest_common_dominator_of_uses (def_operand_p def_p, bool 
*debug_stmts)
tree, return the best basic block between them (inclusive) to place
statements.
 
+   The best basic block should be an immediate dominator of
+   best basic block if we've moved to same loop nest.
+
We want the most control dependent block in the shallowest loop nest.
 
If the resulting block is in a shallower loop nest, then use it.  Else
@@ -204,11 +207,16 @@ select_best_block (basic_block early_bb,
   if (bb_loop_depth (temp_bb) < bb_loop_depth (best_bb))
best_bb = temp_bb;
 
+  /* If we've moved into same loop nest, then that becomes
+our best block.  */
+  if (!gimple_vuse (stmt)
+ && bb_loop_depth (temp_bb) == bb_loop_depth (best_bb))
+best_bb = temp_bb;
+
   /* Walk up the dominator tree, hopefully we'll find a shallower
 loop nest.  */
   temp_bb = get_immediate_dominator (CDI_DOMINATORS, temp_bb);
 }
-
   /* Placing a statement before a setjmp-like function would be invalid
  (it cannot be reevaluated when execution follows an abnormal edge).
  If we selected a block with abnormal predecessors, just punt.  */
@@ -430,6 +438,7 @@ statement_sink_location (gimple *stmt, basic_block frombb,
continue;
  break;
}
+
   use = USE_STMT (one_use);
 
   if (gimple_code (use) != GIMPLE_PHI)
@@ -439,10 +448,7 @@ statement_sink_location (gimple *stmt, basic_block frombb,
  if (sinkbb == frombb)
return false;
 
- if (sinkbb == gimple_bb (use))
-   *togsi = gsi_for_stmt (use);
- else
-   *togsi = gsi_after_labels (sinkbb);
+ *togsi = gsi_after_labels (sinkbb);
 
  return true;
}
-- 
2.39.3



Re: [PATCH v8] tree-ssa-sink: Improve code sinking pass

2023-10-17 Thread Ajit Agarwal
Hello Richard:

Below review comments are incorporated in version 10 of the patch,
Please review and let me know if its okay for trunk.


Thanks & Regards
Ajit

On 17/10/23 2:47 pm, Richard Biener wrote:
> On Tue, Oct 17, 2023 at 10:53 AM Ajit Agarwal  wrote:
>>
>> Hello Richard:
>>
>> On 17/10/23 2:03 pm, Richard Biener wrote:
>>> On Thu, Oct 12, 2023 at 10:42 AM Ajit Agarwal  
>>> wrote:
>>>>
>>>> This patch improves code sinking pass to sink statements before call to 
>>>> reduce
>>>> register pressure.
>>>> Review comments are incorporated. Synced and modified with latest trunk 
>>>> sources.
>>>>
>>>> For example :
>>>>
>>>> void bar();
>>>> int j;
>>>> void foo(int a, int b, int c, int d, int e, int f)
>>>> {
>>>>   int l;
>>>>   l = a + b + c + d +e + f;
>>>>   if (a != 5)
>>>> {
>>>>   bar();
>>>>   j = l;
>>>> }
>>>> }
>>>>
>>>> Code Sinking does the following:
>>>>
>>>> void bar();
>>>> int j;
>>>> void foo(int a, int b, int c, int d, int e, int f)
>>>> {
>>>>   int l;
>>>>
>>>>   if (a != 5)
>>>> {
>>>>   l = a + b + c + d +e + f;
>>>>   bar();
>>>>   j = l;
>>>> }
>>>> }
>>>>
>>>> Bootstrapped regtested on powerpc64-linux-gnu.
>>>>
>>>> Thanks & Regards
>>>> Ajit
>>>>
>>>> tree-ssa-sink: Improve code sinking pass
>>>>
>>>> Currently, code sinking will sink code after function calls.  This 
>>>> increases
>>>> register pressure for callee-saved registers.  The following patch improves
>>>> code sinking by placing the sunk code before calls in the use block or in
>>>> the immediate dominator of the use blocks.
>>>
>>> The patch no longer does what the description above says.
>> Why you think so. Please let me know.
> 
> You talk about calls above but the patch doesn't do anything about calls.  You
> also don't do anything about register pressure, rather the effect of
> your changes
> are to move some stmts by a smaller "distance", whatever effect that has.
> 
>>>
>>> More comments below.
>>>
>>>> 2023-10-12  Ajit Kumar Agarwal  
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>> PR tree-optimization/81953
>>>> * tree-ssa-sink.cc (statement_sink_location): Move statements 
>>>> before
>>>> calls.
>>>> (select_best_block): Add heuristics to select the best blocks in 
>>>> the
>>>> immediate post dominator.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>> PR tree-optimization/81953
>>>> * gcc.dg/tree-ssa/ssa-sink-20.c: New test.
>>>> * gcc.dg/tree-ssa/ssa-sink-21.c: New test.
>>>> ---
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c | 15 
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c | 19 ++
>>>>  gcc/tree-ssa-sink.cc| 39 -
>>>>  3 files changed, 56 insertions(+), 17 deletions(-)
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
>>>>
>>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c 
>>>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
>>>> new file mode 100644
>>>> index 000..d3b79ca5803
>>>> --- /dev/null
>>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
>>>> @@ -0,0 +1,15 @@
>>>> +/* { dg-do compile } */
>>>> +/* { dg-options "-O2 -fdump-tree-sink-stats" } */
>>>> +void bar();
>>>> +int j;
>>>> +void foo(int a, int b, int c, int d, int e, int f)
>>>> +{
>>>> +  int l;
>>>> +  l = a + b + c + d +e + f;
>>>> +  if (a != 5)
>>>> +{
>>>> +  bar();
>>>> +  j = l;
>>>> +}
>>>> +}
>>>> +/* { dg-final { scan-tree-dump 
>>>> {l_12\s+=\s+_4\s+\+\s+f_11\(D\);\n\s+bar\s+\(\)} sink1 } } */
>>>> diff --git a/gcc/testsuite/gcc.dg

[PATCH v9 4/4] ree: Improve ree pass for rs6000 target using defined ABI interfaces

2023-10-19 Thread Ajit Agarwal
Hello All:

This version 9 of the patch uses abi interfaces to remove zero and sign 
extension elimination.
Bootstrapped and regtested on powerpc-linux-gnu.

In this version (version 9) of the patch following review comments are 
incorporated.

a) Removal of hard code zero_extend and sign_extend  in abi interfaces.
b) Source and destination with different registers are considered.
c) Further enhancements.
d) Added sign extension elimination using abi interfaces.


Thanks & Regards
Ajit

ree: Improve ree pass for rs6000 target using defined abi interfaces

For rs6000 target we see redundant zero and sign extension and done
to improve ree pass to eliminate such redundant zero and sign extension
using defined ABI interfaces.

2023-10-20  Ajit Kumar Agarwal  

gcc/ChangeLog:

* ree.cc (combine_reaching_defs): Use of zero_extend and sign_extend
defined abi interfaces.
(add_removable_extension): Use of defined abi interfaces for no
reaching defs.
(abi_extension_candidate_return_reg_p): New function.
(abi_extension_candidate_p): New function.
(abi_extension_candidate_argno_p): New function.
(abi_handle_regs): New function.
(abi_target_promote_function_mode): New function.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/zext-elim-3.C
---
 gcc/ree.cc| 151 +-
 .../g++.target/powerpc/zext-elim-3.C  |  13 ++
 2 files changed, 161 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-3.C

diff --git a/gcc/ree.cc b/gcc/ree.cc
index fc04249fa84..9f21f0e9907 100644
--- a/gcc/ree.cc
+++ b/gcc/ree.cc
@@ -514,7 +514,8 @@ get_uses (rtx_insn *insn, rtx reg)
 if (REGNO (DF_REF_REG (def)) == REGNO (reg))
   break;
 
-  gcc_assert (def != NULL);
+  if (def == NULL)
+return NULL;
 
   ref_chain = DF_REF_CHAIN (def);
 
@@ -750,6 +751,124 @@ get_extended_src_reg (rtx src)
   return src;
 }
 
+/* Return TRUE if target mode is equal to source mode of zero_extend
+   or sign_extend otherwise false.  */
+
+static bool
+abi_target_promote_function_mode (machine_mode mode)
+{
+  int unsignedp;
+  machine_mode tgt_mode
+= targetm.calls.promote_function_mode (NULL_TREE, mode, &unsignedp,
+  NULL_TREE, 1);
+
+  if (tgt_mode == mode)
+return true;
+  else
+return false;
+}
+
+/* Return TRUE if the candidate insn is zero extend and regno is
+   a return registers.  */
+
+static bool
+abi_extension_candidate_return_reg_p (/*rtx_insn *insn, */int regno)
+{
+  if (targetm.calls.function_value_regno_p (regno))
+return true;
+
+  return false;
+}
+
+/* Return TRUE if reg source operand of zero_extend is argument registers
+   and not return registers and source and destination operand are same
+   and mode of source and destination operand are not same.  */
+
+static bool
+abi_extension_candidate_p (rtx_insn *insn)
+{
+  rtx set = single_set (insn);
+  machine_mode dst_mode = GET_MODE (SET_DEST (set));
+  rtx orig_src = XEXP (SET_SRC (set), 0);
+
+  if (!FUNCTION_ARG_REGNO_P (REGNO (orig_src))
+  || abi_extension_candidate_return_reg_p (/*insn,*/ REGNO (orig_src)))
+return false;
+
+  /* Mode of destination and source should be different.  */
+  if (dst_mode == GET_MODE (orig_src))
+return false;
+
+  machine_mode mode = GET_MODE (XEXP (SET_SRC (set), 0));
+  bool promote_p = abi_target_promote_function_mode (mode);
+
+  /* REGNO of source and destination should be same if not
+  promoted.  */
+  if (!promote_p && REGNO (SET_DEST (set)) != REGNO (orig_src))
+return false;
+
+  return true;
+}
+
+/* Return TRUE if the candidate insn is zero extend and regno is
+   an argument registers.  */
+
+static bool
+abi_extension_candidate_argno_p (/*rtx_code code, */int regno)
+{
+  if (FUNCTION_ARG_REGNO_P (regno))
+return true;
+
+  return false;
+}
+
+/* Return TRUE if the candidate insn doesn't have defs and have
+ * uses without RTX_BIN_ARITH/RTX_COMM_ARITH/RTX_UNARY rtx class.  */
+
+static bool
+abi_handle_regs (rtx_insn *insn)
+{
+  if (side_effects_p (PATTERN (insn)))
+return false;
+
+  struct df_link *uses = get_uses (insn, SET_DEST (PATTERN (insn)));
+
+  if (!uses)
+return false;
+
+  for (df_link *use = uses; use; use = use->next)
+{
+  if (!use->ref)
+   return false;
+
+  if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (DF_REF_INSN (use->ref)))
+   return false;
+
+  rtx_insn *use_insn = DF_REF_INSN (use->ref);
+
+  if (GET_CODE (PATTERN (use_insn)) == SET)
+   {
+ rtx_code code = GET_CODE (SET_SRC (PATTERN (use_insn)));
+
+ if (GET_RTX_CLASS (code) == RTX_BIN_ARITH
+ || GET_RTX_CLASS (code) == RTX_COMM_ARITH
+ || GET_RTX_CLASS (code) == RTX_UNARY)
+   return false;
+   }
+ }
+
+  rtx set = single_set (insn);
+
+  if (GET_CODE (SET_SRC (set)) == SIGN_EXTEND)
+{
+  machine_mode m

Re: [PATCH v9 4/4] ree: Improve ree pass for rs6000 target using defined ABI interfaces

2023-10-21 Thread Ajit Agarwal
Hello Vineet:

Thanks for your time and valuable comments.

On 21/10/23 5:26 am, Vineet Gupta wrote:
> On 10/19/23 23:50, Ajit Agarwal wrote:
>> Hello All:
>>
>> This version 9 of the patch uses abi interfaces to remove zero and sign 
>> extension elimination.
>> Bootstrapped and regtested on powerpc-linux-gnu.
>>
>> In this version (version 9) of the patch following review comments are 
>> incorporated.
>>
>> a) Removal of hard code zero_extend and sign_extend  in abi interfaces.
>> b) Source and destination with different registers are considered.
>> c) Further enhancements.
>> d) Added sign extension elimination using abi interfaces.
> 
> As has been trend in the past, I don't think all the review comments have 
> been addressed.
> The standard practice is to reply to reviewer's email and say yay/nay 
> explicitly to each comment. Some of my comments in [1a] are still not 
> resolved, importantly the last two. To be fair you did reply [1b] but the 
> comments were not addressed explicitly.
> 
> [1a] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630814.html
> [1b] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630865.html
> 

I have addressed the last 2 comments in the version 10 of the patch. Please let 
me know if there
is anything missing. Regarding last comments with a providing different tests 
if you have any suggestions please 
let me know.

> Anyhow I gave this a try for RISC-V, specially after [2a][2b] as I was 
> curious to see if this uncovers REE handling extraneous extensions which 
> could potentially be eliminated in Expand itself, which is generally better 
> as it happens earlier in the pipeline.
> 
> [2a] 2023-10-16 8eb9cdd14218 expr: don't clear SUBREG_PROMOTED_VAR_P flag for 
> a promoted subreg [target/111466]
> [2b] https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631818.html
> 
> Bad news is with the patch, we fail to even bootstrap risc-v, buckles over 
> when building libgcc itself.
> 
> The reproducer is embarrassingly simple, build with -O2:
> 
> float a;
> b() { return a; }
> 
> See details below
> 
>> Thanks & Regards
>> Ajit
>>
>> ree: Improve ree pass for rs6000 target using defined abi interfaces
>>
>> For rs6000 target we see redundant zero and sign extension and done
>> to improve ree pass to eliminate such redundant zero and sign extension
>> using defined ABI interfaces.
>>
>> 2023-10-20  Ajit Kumar Agarwal  
>>
>> gcc/ChangeLog:
>>
>>  * ree.cc (combine_reaching_defs): Use of zero_extend and sign_extend
>>  defined abi interfaces.
>>  (add_removable_extension): Use of defined abi interfaces for no
>>  reaching defs.
>>  (abi_extension_candidate_return_reg_p): New function.
>>  (abi_extension_candidate_p): New function.
>>  (abi_extension_candidate_argno_p): New function.
>>  (abi_handle_regs): New function.
>>  (abi_target_promote_function_mode): New function.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * g++.target/powerpc/zext-elim-3.C
>> ---
>>   gcc/ree.cc    | 151 +-
>>   .../g++.target/powerpc/zext-elim-3.C  |  13 ++
>>   2 files changed, 161 insertions(+), 3 deletions(-)
>>   create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-3.C
>>
>> diff --git a/gcc/ree.cc b/gcc/ree.cc
>> index fc04249fa84..9f21f0e9907 100644
>> --- a/gcc/ree.cc
>> +++ b/gcc/ree.cc
>> @@ -514,7 +514,8 @@ get_uses (rtx_insn *insn, rtx reg)
>>   if (REGNO (DF_REF_REG (def)) == REGNO (reg))
>>     break;
>>   -  gcc_assert (def != NULL);
>> +  if (def == NULL)
>> +    return NULL;
>>       ref_chain = DF_REF_CHAIN (def);
>>   @@ -750,6 +751,124 @@ get_extended_src_reg (rtx src)
>>     return src;
>>   }
>>   +/* Return TRUE if target mode is equal to source mode of zero_extend
>> +   or sign_extend otherwise false.  */
>> +
>> +static bool
>> +abi_target_promote_function_mode (machine_mode mode)
>> +{
>> +  int unsignedp;
>> +  machine_mode tgt_mode
>> +    = targetm.calls.promote_function_mode (NULL_TREE, mode, &unsignedp,
>> +   NULL_TREE, 1);
>> +
>> +  if (tgt_mode == mode)
>> +    return true;
>> +  else
>> +    return false;
>> +}
>> +
>> +/* Return TRUE if the candidate insn is zero extend and regno is
>> +   a return registers.  */
>> +
>> +static bool
>> +abi_extension_candidate_return_re

[PATCH v10 4/4] ree: Improve ree pass for rs6000 target using defined ABI interfaces

2023-10-21 Thread Ajit Agarwal
Hello Vineet and Jeff:

This version 10 of the patch uses abi interfaces to remove zero and sign 
extension elimination.
Bootstrapped and regtested on powerpc-linux-gnu.

In this version (version 9) of the patch following review comments are 
incorporated.

a) Removal of hard code zero_extend and sign_extend  in abi interfaces.
b) Source and destination with different registers are considered.
c) Further enhancements.
d) Added sign extension elimination using abi interfaces.
d) Addressed remaining review comments.

Please check if its addressed bootstrapped failure with RISC-V.
Also please let me know if there is anything missing in this patch.

Ok for trunk?

Thanks & Regards
Ajit


ree: Improve ree pass for rs6000 target using defined abi interfaces

For rs6000 target we see redundant zero and sign extension and done
to improve ree pass to eliminate such redundant zero and sign extension
using defined ABI interfaces.

2023-10-21  Ajit Kumar Agarwal  

gcc/ChangeLog:

* ree.cc (combine_reaching_defs): Use of zero_extend and sign_extend
defined abi interfaces.
(add_removable_extension): Use of defined abi interfaces for no
reaching defs.
(abi_extension_candidate_return_reg_p): New function.
(abi_extension_candidate_p): New function.
(abi_extension_candidate_argno_p): New function.
(abi_handle_regs): New function.
(abi_target_promote_function_mode): New function.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/zext-elim-3.C
---
 gcc/ree.cc| 145 +-
 .../g++.target/powerpc/zext-elim-3.C  |  13 ++
 2 files changed, 154 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-3.C

diff --git a/gcc/ree.cc b/gcc/ree.cc
index fc04249fa84..51b1aab165e 100644
--- a/gcc/ree.cc
+++ b/gcc/ree.cc
@@ -514,7 +514,8 @@ get_uses (rtx_insn *insn, rtx reg)
 if (REGNO (DF_REF_REG (def)) == REGNO (reg))
   break;
 
-  gcc_assert (def != NULL);
+  if (def == NULL)
+return NULL;
 
   ref_chain = DF_REF_CHAIN (def);
 
@@ -750,6 +751,124 @@ get_extended_src_reg (rtx src)
   return src;
 }
 
+/* Return TRUE if target mode is equal to source mode of zero_extend
+   or sign_extend otherwise false.  */
+
+static bool
+abi_target_promote_function_mode (machine_mode mode)
+{
+  int unsignedp;
+  machine_mode tgt_mode
+= targetm.calls.promote_function_mode (NULL_TREE, mode, &unsignedp,
+  NULL_TREE, 1);
+
+  if (tgt_mode == mode)
+return true;
+  else
+return false;
+}
+
+/* Return TRUE if the candidate insn is zero extend and regno is
+   a return registers.  */
+
+static bool
+abi_extension_candidate_return_reg_p (/*rtx_insn *insn, */int regno)
+{
+  if (targetm.calls.function_value_regno_p (regno))
+return true;
+
+  return false;
+}
+
+/* Return TRUE if reg source operand of zero_extend is argument registers
+   and not return registers and source and destination operand are same
+   and mode of source and destination operand are not same.  */
+
+static bool
+abi_extension_candidate_p (rtx_insn *insn)
+{
+  rtx set = single_set (insn);
+  machine_mode dst_mode = GET_MODE (SET_DEST (set));
+  rtx orig_src = XEXP (SET_SRC (set), 0);
+
+  if (!FUNCTION_ARG_REGNO_P (REGNO (orig_src))
+  || abi_extension_candidate_return_reg_p (/*insn,*/ REGNO (orig_src)))
+return false;
+
+  /* Mode of destination and source should be different.  */
+  if (dst_mode == GET_MODE (orig_src))
+return false;
+
+  machine_mode mode = GET_MODE (XEXP (SET_SRC (set), 0));
+  bool promote_p = abi_target_promote_function_mode (mode);
+
+  /* REGNO of source and destination should be same if not
+  promoted.  */
+  if (!promote_p && REGNO (SET_DEST (set)) != REGNO (orig_src))
+return false;
+
+  return true;
+}
+
+/* Return TRUE if the candidate insn is zero extend and regno is
+   an argument registers.  */
+
+static bool
+abi_extension_candidate_argno_p (/*rtx_code code, */int regno)
+{
+  if (FUNCTION_ARG_REGNO_P (regno))
+return true;
+
+  return false;
+}
+
+/* Return TRUE if the candidate insn doesn't have defs and have
+ * uses without RTX_BIN_ARITH/RTX_COMM_ARITH/RTX_UNARY rtx class.  */
+
+static bool
+abi_handle_regs (rtx_insn *insn)
+{
+  if (side_effects_p (PATTERN (insn)))
+return false;
+
+  struct df_link *uses = get_uses (insn, SET_DEST (PATTERN (insn)));
+
+  if (!uses)
+return false;
+
+  for (df_link *use = uses; use; use = use->next)
+{
+  if (!use->ref)
+   return false;
+
+  if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (DF_REF_INSN (use->ref)))
+   return false;
+
+  rtx_insn *use_insn = DF_REF_INSN (use->ref);
+
+  if (GET_CODE (PATTERN (use_insn)) == SET)
+   {
+ rtx_code code = GET_CODE (SET_SRC (PATTERN (use_insn)));
+
+ if (GET_RTX_CLASS (code) == RTX_BIN_ARITH
+ || GET_RTX_CLASS (code) == RTX_COMM_ARITH
+  

[PATCH V11] ree: Improve ree pass using defined abi interfaces

2023-10-22 Thread Ajit Agarwal
Hello Vineet, Jeff and Bernhard:

This version 11 of the patch uses abi interfaces to remove zero and sign 
extension elimination.
Bootstrapped and regtested on powerpc-linux-gnu.

In this version (version 11) of the patch following review comments are 
incorporated.

a) Removal of hard code zero_extend and sign_extend  in abi interfaces.
b) Source and destination with different registers are considered.
c) Further enhancements.
d) Added sign extension elimination using abi interfaces.
d) Addressed remaining review comments from Vineet.
e) Addressed review comments from Bernhard.

Please check if its addressed bootstrapped failure with RISC-V.
Also please let me know if there is anything missing in this patch.

Ok for trunk?

Thanks & Regards
Ajit


ree: Improve ree pass using defined abi interfaces

For rs6000 target we see redundant zero and sign extension and done
to improve ree pass to eliminate such redundant zero and sign extension
using defined ABI interfaces.

2023-10-23  Ajit Kumar Agarwal  

gcc/ChangeLog:

* ree.cc (combine_reaching_defs): Use of zero_extend and sign_extend
defined abi interfaces.
(add_removable_extension): Use of defined abi interfaces for no
reaching defs.
(abi_extension_candidate_return_reg_p): New function.
(abi_extension_candidate_p): New function.
(abi_extension_candidate_argno_p): New function.
(abi_handle_regs): New function.
(abi_target_promote_function_mode): New function.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/zext-elim-3.C
---
 gcc/ree.cc| 144 +-
 .../g++.target/powerpc/zext-elim-3.C  |  13 ++
 2 files changed, 151 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-3.C

diff --git a/gcc/ree.cc b/gcc/ree.cc
index fc04249fa84..88528b048c2 100644
--- a/gcc/ree.cc
+++ b/gcc/ree.cc
@@ -514,7 +514,8 @@ get_uses (rtx_insn *insn, rtx reg)
 if (REGNO (DF_REF_REG (def)) == REGNO (reg))
   break;
 
-  gcc_assert (def != NULL);
+  if (def == NULL)
+return NULL;
 
   ref_chain = DF_REF_CHAIN (def);
 
@@ -750,6 +751,117 @@ get_extended_src_reg (rtx src)
   return src;
 }
 
+/* Return TRUE if target mode is equal to source mode, false otherwise.  */
+
+static bool
+abi_target_promote_function_mode (machine_mode mode)
+{
+  int unsignedp;
+  machine_mode tgt_mode
+= targetm.calls.promote_function_mode (NULL_TREE, mode, &unsignedp,
+  NULL_TREE, 1);
+
+  if (tgt_mode == mode)
+return true;
+  else
+return false;
+}
+
+/* Return TRUE if regno is a return register.  */
+
+static inline bool
+abi_extension_candidate_return_reg_p (int regno)
+{
+  return targetm.calls.function_value_regno_p (regno);
+}
+
+/* Return TRUE if the following conditions are satisified.
+
+  a) reg source operand is argument register and not return register.
+  b) mode of source and destination operand are different.
+  c) if not promoted REGNO of source and destination operand are same.  */
+
+static bool
+abi_extension_candidate_p (rtx_insn *insn)
+{
+  rtx set = single_set (insn);
+  machine_mode dst_mode = GET_MODE (SET_DEST (set));
+  rtx orig_src = XEXP (SET_SRC (set), 0);
+
+  if (!FUNCTION_ARG_REGNO_P (REGNO (orig_src))
+  || abi_extension_candidate_return_reg_p (/*insn,*/ REGNO (orig_src)))
+return false;
+
+  /* Return FALSE if mode of destination and source is same.  */
+  if (dst_mode == GET_MODE (orig_src))
+return false;
+
+  machine_mode mode = GET_MODE (XEXP (SET_SRC (set), 0));
+  bool promote_p = abi_target_promote_function_mode (mode);
+
+  /* Return FALSE if promote is false and REGNO of source and destination
+ is different.  */
+  if (!promote_p && REGNO (SET_DEST (set)) != REGNO (orig_src))
+return false;
+
+  return true;
+}
+
+/* Return TRUE if regno is an argument register.  */
+
+static inline bool
+abi_extension_candidate_argno_p (int regno)
+{
+  return FUNCTION_ARG_REGNO_P (regno);
+}
+
+/* Return TRUE if the candidate insn doesn't have defs and have
+ * uses without RTX_BIN_ARITH/RTX_COMM_ARITH/RTX_UNARY rtx class.  */
+
+static bool
+abi_handle_regs (rtx_insn *insn)
+{
+  if (side_effects_p (PATTERN (insn)))
+return false;
+
+  struct df_link *uses = get_uses (insn, SET_DEST (PATTERN (insn)));
+
+  if (!uses)
+return false;
+
+  for (df_link *use = uses; use; use = use->next)
+{
+  if (!use->ref)
+   return false;
+
+  if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (DF_REF_INSN (use->ref)))
+   return false;
+
+  rtx_insn *use_insn = DF_REF_INSN (use->ref);
+
+  if (GET_CODE (PATTERN (use_insn)) == SET)
+   {
+ rtx_code code = GET_CODE (SET_SRC (PATTERN (use_insn)));
+
+ if (GET_RTX_CLASS (code) == RTX_BIN_ARITH
+ || GET_RTX_CLASS (code) == RTX_COMM_ARITH
+ || GET_RTX_CLASS (code) == RTX_UNARY)
+   return false;
+   }
+   

Re: [PATCH v9 4/4] ree: Improve ree pass for rs6000 target using defined ABI interfaces

2023-10-22 Thread Ajit Agarwal
Hello All:

Addressed below review comments in the version 11 of the patch.
Please review and please let me know if its ok for trunk.

Thanks & Regards
Ajit

On 22/10/23 12:56 am, rep.dot@gmail.com wrote:
> On 21 October 2023 01:56:16 CEST, Vineet Gupta  wrote:
>> On 10/19/23 23:50, Ajit Agarwal wrote:
>>> Hello All:
>>>
>>> This version 9 of the patch uses abi interfaces to remove zero and sign 
>>> extension elimination.
>>> Bootstrapped and regtested on powerpc-linux-gnu.
>>>
>>> In this version (version 9) of the patch following review comments are 
>>> incorporated.
>>>
>>> a) Removal of hard code zero_extend and sign_extend  in abi interfaces.
>>> b) Source and destination with different registers are considered.
>>> c) Further enhancements.
>>> d) Added sign extension elimination using abi interfaces.
>>
>> As has been trend in the past, I don't think all the review comments have 
>> been addressed.
> 
> And apart from that, may I ask if this is just me, or does anybody else think 
> that it might be worthwhile to actually read a patch before (re-)posting?
> 
> Seeing e.g. the proposed abi_extension_candidate_p as written in a first POC 
> would deserve some manual CSE, if nothing more then for clarity and 
> conciseness?
> 
> Just curious from a meta perspective..
> 
> And:
> 
>>> ree: Improve ree pass for rs6000 target using defined abi interfaces
> 
> mentioning powerpc like this, and then changing generic code could be 
> interpreted as misleading, IMHO.
> 
>>>
>>> For rs6000 target we see redundant zero and sign extension and done
>>> to improve ree pass to eliminate such redundant zero and sign extension
>>> using defined ABI interfaces.
> 
> Mentioning powerpc in the body as one of the affected target(s) is of course 
> fine.
> 
> 
>>>   +/* Return TRUE if target mode is equal to source mode of zero_extend
>>> +   or sign_extend otherwise false.  */
> 
> , false otherwise.
> 
> But I'm not a native speaker 
> 
> 
>>> +/* Return TRUE if the candidate insn is zero extend and regno is
>>> +   a return registers.  */
>>> +
>>> +static bool
>>> +abi_extension_candidate_return_reg_p (/*rtx_insn *insn, */int regno)
> 
> Leftover debug comment.
> 
>>> +{
>>> +  if (targetm.calls.function_value_regno_p (regno))
>>> +return true;
>>> +
>>> +  return false;
>>> +}
>>> +
> 
> As said, I don't see why the below was not cleaned up before the V1 
> submission.
> Iff it breaks when manually CSEing, I'm curious why?
> 
>>> +/* Return TRUE if reg source operand of zero_extend is argument registers
>>> +   and not return registers and source and destination operand are same
>>> +   and mode of source and destination operand are not same.  */
>>> +
>>> +static bool
>>> +abi_extension_candidate_p (rtx_insn *insn)
>>> +{
>>> +  rtx set = single_set (insn);
>>> +  machine_mode dst_mode = GET_MODE (SET_DEST (set));
>>> +  rtx orig_src = XEXP (SET_SRC (set), 0);
>>> +
>>> +  if (!FUNCTION_ARG_REGNO_P (REGNO (orig_src))
>>> +  || abi_extension_candidate_return_reg_p (/*insn,*/ REGNO (orig_src)))
> 
> On top, debug leftover.
> 
>>> +return false;
>>> +
>>> +  /* Mode of destination and source should be different.  */
>>> +  if (dst_mode == GET_MODE (orig_src))
>>> +return false;
>>> +
>>> +  machine_mode mode = GET_MODE (XEXP (SET_SRC (set), 0));
>>> +  bool promote_p = abi_target_promote_function_mode (mode);
>>> +
>>> +  /* REGNO of source and destination should be same if not
>>> +  promoted.  */
>>> +  if (!promote_p && REGNO (SET_DEST (set)) != REGNO (orig_src))
>>> +return false;
>>> +
>>> +  return true;
>>> +}
>>> +
> 
> As said, please also rephrase the above (and everything else if it obviously 
> looks akin the above).
> 
> The rest, mentioned below,  should largely be covered by following the coding 
> convention.
> 
>>> +/* Return TRUE if the candidate insn is zero extend and regno is
>>> +   an argument registers.  */
> 
> Singular register.
> 
>>> +
>>> +static bool
>>> +abi_extension_candidate_argno_p (/*rtx_code code, */int regno)
> 
> Debug leftover.
> I would probably have inlined this function manually, with a respective 
> comment.
> Did not look how oft

[PING ^1][PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp

2023-10-23 Thread Ajit Agarwal



Ping ^1.

 Forwarded Message 
Subject: [PING ^0][PATCH v2] rs6000: Add new pass for replacement of contiguous 
addresses vector load lxv with lxvp
Date: Sun, 15 Oct 2023 17:43:24 +0530
From: Ajit Agarwal 
To: gcc-patches 
CC: Segher Boessenkool , Kewen.Lin 
, Peter Bergner 

Hello All:

Please review.

Thanks & Regards
Ajit


 Forwarded Message 
Subject: [PATCH v2] rs6000: Add new pass for replacement of contiguous 
addresses vector load lxv with lxvp
Date: Sun, 8 Oct 2023 00:34:27 +0530
From: Ajit Agarwal 
To: gcc-patches 
CC: Segher Boessenkool , Peter Bergner 
, Kewen.Lin 

Hello All:

This patch add new pass to replace contiguous addresses vector load lxv with 
mma instruction
lxvp. This patch addresses one regressions failure in ARM architecture.

Bootstrapped and regtested with powepc64-linux-gnu.

Thanks & Regards
Ajit


rs6000: Add new pass for replacement of contiguous lxv with lxvp.

New pass to replace contiguous addresses lxv with lxvp. This pass
is registered after ree rtl pass.

2023-10-07  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/rs6000/rs6000-passes.def: Registered vecload pass.
* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
* config.gcc: Add new executable.
* config/rs6000/rs6000-protos.h: Add new prototype for vecload
pass.
* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
* config/rs6000/t-rs6000: Add new rule.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/vecload.C: New test.
---
 gcc/config.gcc |   4 +-
 gcc/config/rs6000/rs6000-passes.def|   1 +
 gcc/config/rs6000/rs6000-protos.h  |   2 +
 gcc/config/rs6000/rs6000-vecload-opt.cc| 234 +
 gcc/config/rs6000/rs6000.cc|   3 +-
 gcc/config/rs6000/t-rs6000 |   4 +
 gcc/testsuite/g++.target/powerpc/vecload.C |  15 ++
 7 files changed, 260 insertions(+), 3 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C

diff --git a/gcc/config.gcc b/gcc/config.gcc
index ee46d96bf62..482ab094b89 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -515,7 +515,7 @@ or1k*-*-*)
;;
 powerpc*-*-*)
cpu_type=rs6000
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
@@ -552,7 +552,7 @@ riscv*)
;;
 rs6000*-*-*)
extra_options="${extra_options} g.opt fused-madd.opt 
rs6000/rs6000-tables.opt"
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-logue.cc 
\$(srcdir)/config/rs6000/rs6000-call.cc"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
diff --git a/gcc/config/rs6000/rs6000-passes.def 
b/gcc/config/rs6000/rs6000-passes.def
index ca899d5f7af..9ecf8ce6a9c 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -28,6 +28,7 @@ along with GCC; see the file COPYING3.  If not see
  The power8 does not have instructions that automaticaly do the byte swaps
  for loads and stores.  */
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
+  INSERT_PASS_AFTER (pass_ree, 1, pass_analyze_vecload);
 
   /* Pass to do the PCREL_OPT optimization that combines the load of an
  external symbol's address along with a single load or store using that
diff --git a/gcc/config/rs6000/rs6000-protos.h 
b/gcc/config/rs6000/rs6000-protos.h
index f70118ea40f..9c44bae33d3 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -91,6 +91,7 @@ extern int mems_ok_for_quad_peep (rtx, rtx);
 extern bool gpr_or_gpr_p (rtx, rtx);
 extern bool direct_move_p (rtx, rtx);
 extern bool quad_address_p (rtx, machine_mode, bool);
+extern bool mode_supports_dq_form (machine_mode);
 extern bool quad_load_store_p (rtx, rtx);
 extern bool fusion_gpr_load_p (rtx, rtx, rtx, rtx);
 extern void expand_fusion_gpr_load (rtx *);
@@ -344,6 +345,7 @@ class rtl_opt_pass;
 
 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
 extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
+extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
 extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
 extern bool rs6000_quadword_masked_address_p (const_rtx exp);
 extern 

[PATCH V12 4/4] ree: Improve ree pass using defined abi interfaces

2023-10-23 Thread Ajit Agarwal
Hello Vineet, Jeff and Bernhard:

This version 11 of the patch uses abi interfaces to remove zero and sign 
extension elimination.
Bootstrapped and regtested on powerpc-linux-gnu.

In this version (version 11) of the patch following review comments are 
incorporated.

a) Removal of hard code zero_extend and sign_extend  in abi interfaces.
b) Source and destination with different registers are considered.
c) Further enhancements.
d) Added sign extension elimination using abi interfaces.
d) Addressed remaining review comments from Vineet.
e) Addressed review comments from Bernhard.
f) Fix aarch64 regressions failure.

Please check if its addressed bootstrapped failure with RISC-V.
Also please let me know if there is anything missing in this patch.

Ok for trunk?

Thanks & Regards
Ajit

ree: Improve ree pass using defined abi interfaces

For rs6000 target we see redundant zero and sign extension and done
to improve ree pass to eliminate such redundant zero and sign extension
using defined ABI interfaces.

2023-10-23  Ajit Kumar Agarwal  

gcc/ChangeLog:

* ree.cc (combine_reaching_defs): Use of zero_extend and sign_extend
defined abi interfaces.
(add_removable_extension): Use of defined abi interfaces for no
reaching defs.
(abi_extension_candidate_return_reg_p): New function.
(abi_extension_candidate_p): New function.
(abi_extension_candidate_argno_p): New function.
(abi_handle_regs): New function.
(abi_target_promote_function_mode): New function.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/zext-elim-3.C
---
 gcc/ree.cc| 147 +-
 .../g++.target/powerpc/zext-elim-3.C  |  13 ++
 2 files changed, 154 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-3.C

diff --git a/gcc/ree.cc b/gcc/ree.cc
index fc04249fa84..9fdc06562ad 100644
--- a/gcc/ree.cc
+++ b/gcc/ree.cc
@@ -514,7 +514,8 @@ get_uses (rtx_insn *insn, rtx reg)
 if (REGNO (DF_REF_REG (def)) == REGNO (reg))
   break;
 
-  gcc_assert (def != NULL);
+  if (def == NULL)
+return NULL;
 
   ref_chain = DF_REF_CHAIN (def);
 
@@ -750,6 +751,120 @@ get_extended_src_reg (rtx src)
   return src;
 }
 
+/* Return TRUE if target mode is equal to source mode, false otherwise.  */
+
+static bool
+abi_target_promote_function_mode (machine_mode mode)
+{
+  int unsignedp;
+  machine_mode tgt_mode
+= targetm.calls.promote_function_mode (NULL_TREE, mode, &unsignedp,
+  NULL_TREE, 1);
+
+  if (tgt_mode == mode)
+return true;
+  else
+return false;
+}
+
+/* Return TRUE if regno is a return register.  */
+
+static inline bool
+abi_extension_candidate_return_reg_p (int regno)
+{
+  if (targetm.calls.function_value_regno_p (regno))
+return true;
+
+  return false;
+}
+
+/* Return TRUE if the following conditions are satisified.
+
+  a) reg source operand is argument register and not return register.
+  b) mode of source and destination operand are different.
+  c) if not promoted REGNO of source and destination operand are same.  */
+
+static bool
+abi_extension_candidate_p (rtx_insn *insn)
+{
+  rtx set = single_set (insn);
+  machine_mode dst_mode = GET_MODE (SET_DEST (set));
+  rtx orig_src = XEXP (SET_SRC (set), 0);
+
+  if (!FUNCTION_ARG_REGNO_P (REGNO (orig_src))
+  || abi_extension_candidate_return_reg_p (/*insn,*/ REGNO (orig_src)))
+return false;
+
+  /* Return FALSE if mode of destination and source is same.  */
+  if (dst_mode == GET_MODE (orig_src))
+return false;
+
+  machine_mode mode = GET_MODE (XEXP (SET_SRC (set), 0));
+  bool promote_p = abi_target_promote_function_mode (mode);
+
+  /* Return FALSE if promote is false and REGNO of source and destination
+ is different.  */
+  if (!promote_p && REGNO (SET_DEST (set)) != REGNO (orig_src))
+return false;
+
+  return true;
+}
+
+/* Return TRUE if regno is an argument register.  */
+
+static inline bool
+abi_extension_candidate_argno_p (int regno)
+{
+  return FUNCTION_ARG_REGNO_P (regno);
+}
+
+/* Return TRUE if the candidate insn doesn't have defs and have
+ * uses without RTX_BIN_ARITH/RTX_COMM_ARITH/RTX_UNARY rtx class.  */
+
+static bool
+abi_handle_regs (rtx_insn *insn)
+{
+  if (side_effects_p (PATTERN (insn)))
+return false;
+
+  struct df_link *uses = get_uses (insn, SET_DEST (PATTERN (insn)));
+
+  if (!uses)
+return false;
+
+  for (df_link *use = uses; use; use = use->next)
+{
+  if (!use->ref)
+   return false;
+
+  if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (DF_REF_INSN (use->ref)))
+   return false;
+
+  rtx_insn *use_insn = DF_REF_INSN (use->ref);
+
+  if (GET_CODE (PATTERN (use_insn)) == SET)
+   {
+ rtx_code code = GET_CODE (SET_SRC (PATTERN (use_insn)));
+
+ if (GET_RTX_CLASS (code) == RTX_BIN_ARITH
+ || GET_RTX_CLASS (code) == RTX_COMM_ARITH
+ || GET_RT

Re: [PATCH v9 4/4] ree: Improve ree pass for rs6000 target using defined ABI interfaces

2023-10-24 Thread Ajit Agarwal
Hello Bernhard:

On 23/10/23 7:40 pm, Bernhard Reutner-Fischer wrote:
> On Mon, 23 Oct 2023 12:16:18 +0530
> Ajit Agarwal  wrote:
> 
>> Hello All:
>>
>> Addressed below review comments in the version 11 of the patch.
>> Please review and please let me know if its ok for trunk.
> 
> s/satisified/satisfied/
> 

I will fix this.

>>> As said, I don't see why the below was not cleaned up before the V1 
>>> submission.
>>> Iff it breaks when manually CSEing, I'm curious why?
> 
> The function below looks identical in v12 of the patch.
> Why didn't you use common subexpressions?
> ba

Using CSE here breaks aarch64 regressions hence I have reverted it back 
not to use CSE,

>>>   
>>>>> +/* Return TRUE if reg source operand of zero_extend is argument registers
>>>>> +   and not return registers and source and destination operand are same
>>>>> +   and mode of source and destination operand are not same.  */
>>>>> +
>>>>> +static bool
>>>>> +abi_extension_candidate_p (rtx_insn *insn)
>>>>> +{
>>>>> +  rtx set = single_set (insn);
>>>>> +  machine_mode dst_mode = GET_MODE (SET_DEST (set));
>>>>> +  rtx orig_src = XEXP (SET_SRC (set), 0);
>>>>> +
>>>>> +  if (!FUNCTION_ARG_REGNO_P (REGNO (orig_src))
>>>>> +  || abi_extension_candidate_return_reg_p (/*insn,*/ REGNO 
>>>>> (orig_src)))  
>>>>> +return false;
>>>>> +
>>>>> +  /* Mode of destination and source should be different.  */
>>>>> +  if (dst_mode == GET_MODE (orig_src))
>>>>> +return false;
>>>>> +
>>>>> +  machine_mode mode = GET_MODE (XEXP (SET_SRC (set), 0));
>>>>> +  bool promote_p = abi_target_promote_function_mode (mode);
>>>>> +
>>>>> +  /* REGNO of source and destination should be same if not
>>>>> +  promoted.  */
>>>>> +  if (!promote_p && REGNO (SET_DEST (set)) != REGNO (orig_src))
>>>>> +return false;
>>>>> +
>>>>> +  return true;
>>>>> +}
>>>>> +  
> 
> 
>>>
>>> As said, please also rephrase the above (and everything else if it 
>>> obviously looks akin the above).
> 
> thanks


Re: [PATCH v9 4/4] ree: Improve ree pass for rs6000 target using defined ABI interfaces

2023-10-24 Thread Ajit Agarwal
Hello Vineet:

On 24/10/23 12:02 am, Vineet Gupta wrote:
> 
> 
> On 10/22/23 23:46, Ajit Agarwal wrote:
>> Hello All:
>>
>> Addressed below review comments in the version 11 of the patch.
>> Please review and please let me know if its ok for trunk.
>>
>> Thanks & Regards
>> Ajit
> 
> Again you are not paying attention to prior comments about fixing your 
> submission practice and like some of the prior reviewers I'm starting to get 
> tired, despite potentially good technical content.
> 

Sorry for the inconvenience caused. I will make sure all the comments from 
reviewers
are addressed.

> 1. The commentary above is NOT part of changelog. Either use a separate cover 
> letter or add patch version change history between two "---" lines just 
> before the start of code diff. And keep accumulating those as you post new 
> versions. See [1]. This is so reviewers knwo what changed over 10 months and 
> automatically gets dropped when patch is eventually applied/merged into tree.
>

Sure I will do that.
 
> 2. Acknowledge (even if it is yes) each and every comment of the reviewerw 
> explicitly inline below. That ensures you don't miss addressing a change 
> since this forces one to think about each of them.
> 

Surely I will acknowledge each and every comments inline.

> I do have some technical comments which I'll follow up with later.

I look forward to it.

> Just a short summary that v10 indeed bootstraps risc-v but I don't see any 
> improvements at all - as in whenever abi interfaces code identifies and 
> extension (saw missing a definition, the it is not able to eliminate any 
> extensions despite the patch.
>

Thanks for the summary and the check. 

Thanks & Regards
Ajit
 
> -Vineet
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632180.html
> 
>>
>> On 22/10/23 12:56 am, rep.dot@gmail.com wrote:
>>> On 21 October 2023 01:56:16 CEST, Vineet Gupta  wrote:
>>>> On 10/19/23 23:50, Ajit Agarwal wrote:
>>>>> Hello All:
>>>>>
>>>>> This version 9 of the patch uses abi interfaces to remove zero and sign 
>>>>> extension elimination.
>>>>> Bootstrapped and regtested on powerpc-linux-gnu.
>>>>>
>>>>> In this version (version 9) of the patch following review comments are 
>>>>> incorporated.
>>>>>
>>>>> a) Removal of hard code zero_extend and sign_extend  in abi interfaces.
>>>>> b) Source and destination with different registers are considered.
>>>>> c) Further enhancements.
>>>>> d) Added sign extension elimination using abi interfaces.
>>>> As has been trend in the past, I don't think all the review comments have 
>>>> been addressed.
>>> And apart from that, may I ask if this is just me, or does anybody else 
>>> think that it might be worthwhile to actually read a patch before 
>>> (re-)posting?
>>>
>>> Seeing e.g. the proposed abi_extension_candidate_p as written in a first 
>>> POC would deserve some manual CSE, if nothing more then for clarity and 
>>> conciseness?
>>>
>>> Just curious from a meta perspective..
>>>
>>> And:
>>>
>>>>> ree: Improve ree pass for rs6000 target using defined abi interfaces
>>> mentioning powerpc like this, and then changing generic code could be 
>>> interpreted as misleading, IMHO.
>>>
>>>>> For rs6000 target we see redundant zero and sign extension and done
>>>>> to improve ree pass to eliminate such redundant zero and sign extension
>>>>> using defined ABI interfaces.
>>> Mentioning powerpc in the body as one of the affected target(s) is of 
>>> course fine.
>>>
>>>
>>>>>    +/* Return TRUE if target mode is equal to source mode of zero_extend
>>>>> +   or sign_extend otherwise false.  */
>>> , false otherwise.
>>>
>>> But I'm not a native speaker
>>>
>>>
>>>>> +/* Return TRUE if the candidate insn is zero extend and regno is
>>>>> +   a return registers.  */
>>>>> +
>>>>> +static bool
>>>>> +abi_extension_candidate_return_reg_p (/*rtx_insn *insn, */int regno)
>>> Leftover debug comment.
>>>
>>>>> +{
>>>>> +  if (targetm.calls.function_value_regno_p (regno))
>>>>> +    return true;
>>>>> +
>>>>> +  return false;
>>>>> +}
>>>>> +
&g

[PATCH V13] ree: Improve ree pass using defined abi interfaces

2023-10-24 Thread Ajit Agarwal
Hello Vineet, Jeff and Bernhard:

This version 13 of the patch uses abi interfaces to remove zero and sign 
extension elimination.
Bootstrapped and regtested on powerpc-linux-gnu.

In this version (version 13) of the patch following review comments are 
incorporated.

a) Removal of hard code zero_extend and sign_extend  in abi interfaces.
b) Source and destination with different registers are considered.
c) Further enhancements.
d) Added sign extension elimination using abi interfaces.
d) Addressed remaining review comments from Vineet.
e) Addressed review comments from Bernhard.
f) Fix aarch64 regressions failure.

Please let me know if there is anything missing in this patch.

Ok for trunk?

Thanks & Regards
Ajit

ree: Improve ree pass using defined abi interfaces

For rs6000 target we see zero and sign extend with missing
definitions. Improved to eliminate such zero and sign extension
using defined ABI interfaces.

2023-10-24  Ajit Kumar Agarwal  

gcc/ChangeLog:

* ree.cc (combine_reaching_defs): Eliminate zero_extend and sign_extend
using defined abi interfaces.
(add_removable_extension): Use of defined abi interfaces for no
reaching defs.
(abi_extension_candidate_return_reg_p): New function.
(abi_extension_candidate_p): New function.
(abi_extension_candidate_argno_p): New function.
(abi_handle_regs): New function.
(abi_target_promote_function_mode): New function.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/zext-elim-3.C
---
changes since v6:
  - Added missing abi interfaces.
  - Rearranging and restructuring the code.
  - Removal of hard coded zero extend and sign extend in abi interfaces.
  - Relaxed different registers with source and destination in abi interfaces.
  - Using CSE in abi interfaces.
  - Fix aarch64 regressions.
  - Add Sign extension removal in abi interfaces.
  - Modified comments as per coding convention.
  - Modified code as per coding convention.
  - Fix bug bootstrapping RISCV failures.
---
 gcc/ree.cc| 144 +-
 .../g++.target/powerpc/zext-elim-3.C  |  13 ++
 2 files changed, 151 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-3.C

diff --git a/gcc/ree.cc b/gcc/ree.cc
index fc04249fa84..72e3b625a18 100644
--- a/gcc/ree.cc
+++ b/gcc/ree.cc
@@ -514,7 +514,8 @@ get_uses (rtx_insn *insn, rtx reg)
 if (REGNO (DF_REF_REG (def)) == REGNO (reg))
   break;
 
-  gcc_assert (def != NULL);
+  if (def == NULL)
+return NULL;
 
   ref_chain = DF_REF_CHAIN (def);
 
@@ -750,6 +751,117 @@ get_extended_src_reg (rtx src)
   return src;
 }
 
+/* Return TRUE if target mode is equal to source mode, false otherwise.  */
+
+static bool
+abi_target_promote_function_mode (machine_mode mode)
+{
+  int unsignedp;
+  machine_mode tgt_mode
+= targetm.calls.promote_function_mode (NULL_TREE, mode, &unsignedp,
+  NULL_TREE, 1);
+
+  return tgt_mode == mode;
+}
+
+/* Return TRUE if regno is a return register.  */
+
+static inline bool
+abi_extension_candidate_return_reg_p (int regno)
+{
+  if (targetm.calls.function_value_regno_p (regno))
+return true;
+
+  return false;
+}
+
+/* Return TRUE if the following conditions are satisfied.
+
+  a) reg source operand is argument register and not return register.
+  b) mode of source and destination operand are different.
+  c) if not promoted REGNO of source and destination operand are same.  */
+
+static bool
+abi_extension_candidate_p (rtx_insn *insn)
+{
+  rtx set = single_set (insn);
+  machine_mode dst_mode = GET_MODE (SET_DEST (set));
+  rtx orig_src = XEXP (SET_SRC (set), 0);
+
+  if (!FUNCTION_ARG_REGNO_P (REGNO (orig_src))
+  || abi_extension_candidate_return_reg_p (REGNO (orig_src)))
+return false;
+
+  /* Return FALSE if mode of destination and source is same.  */
+  if (dst_mode == GET_MODE (orig_src))
+return false;
+
+  machine_mode mode = GET_MODE (XEXP (SET_SRC (set), 0));
+  bool promote_p = abi_target_promote_function_mode (mode);
+
+  /* Return FALSE if promote is false and REGNO of source and destination
+ is different.  */
+  if (!promote_p && REGNO (SET_DEST (set)) != REGNO (orig_src))
+return false;
+
+  return true;
+}
+
+/* Return TRUE if regno is an argument register.  */
+
+static inline bool
+abi_extension_candidate_argno_p (int regno)
+{
+  return FUNCTION_ARG_REGNO_P (regno);
+}
+
+/* Return TRUE if the candidate insn doesn't have defs and have
+ * uses without RTX_BIN_ARITH/RTX_COMM_ARITH/RTX_UNARY rtx class.  */
+
+static bool
+abi_handle_regs (rtx_insn *insn)
+{
+  if (side_effects_p (PATTERN (insn)))
+return false;
+
+  struct df_link *uses = get_uses (insn, SET_DEST (PATTERN (insn)));
+
+  if (!uses)
+return false;
+
+  for (df_link *use = uses; use; use = use->next)
+{
+  if (!use->ref)
+   return false;
+
+  if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_

Re: [PATCH v9 4/4] ree: Improve ree pass for rs6000 target using defined ABI interfaces

2023-10-24 Thread Ajit Agarwal



On 24/10/23 1:10 pm, Ajit Agarwal wrote:
> Hello Vineet:
> 
> On 24/10/23 12:02 am, Vineet Gupta wrote:
>>
>>
>> On 10/22/23 23:46, Ajit Agarwal wrote:
>>> Hello All:
>>>
>>> Addressed below review comments in the version 11 of the patch.
>>> Please review and please let me know if its ok for trunk.
>>>
>>> Thanks & Regards
>>> Ajit
>>
>> Again you are not paying attention to prior comments about fixing your 
>> submission practice and like some of the prior reviewers I'm starting to get 
>> tired, despite potentially good technical content.
>>
> 
> Sorry for the inconvenience caused. I will make sure all the comments from 
> reviewers
> are addressed.
> 
>> 1. The commentary above is NOT part of changelog. Either use a separate 
>> cover letter or add patch version change history between two "---" lines 
>> just before the start of code diff. And keep accumulating those as you post 
>> new versions. See [1]. This is so reviewers knwo what changed over 10 months 
>> and automatically gets dropped when patch is eventually applied/merged into 
>> tree.
>>
> 
> Sure I will do that.

Made changes in version 13 of the patch with changes since v6.

Thanks & Regards
Ajit
>  
>> 2. Acknowledge (even if it is yes) each and every comment of the reviewerw 
>> explicitly inline below. That ensures you don't miss addressing a change 
>> since this forces one to think about each of them.
>>
> 
> Surely I will acknowledge each and every comments inline.
> 
>> I do have some technical comments which I'll follow up with later.
> 
> I look forward to it.
> 
>> Just a short summary that v10 indeed bootstraps risc-v but I don't see any 
>> improvements at all - as in whenever abi interfaces code identifies and 
>> extension (saw missing a definition, the it is not able to eliminate any 
>> extensions despite the patch.
>>
> 
> Thanks for the summary and the check. 
> 
> Thanks & Regards
> Ajit
>  
>> -Vineet
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632180.html
>>
>>>
>>> On 22/10/23 12:56 am, rep.dot@gmail.com wrote:
>>>> On 21 October 2023 01:56:16 CEST, Vineet Gupta  
>>>> wrote:
>>>>> On 10/19/23 23:50, Ajit Agarwal wrote:
>>>>>> Hello All:
>>>>>>
>>>>>> This version 9 of the patch uses abi interfaces to remove zero and sign 
>>>>>> extension elimination.
>>>>>> Bootstrapped and regtested on powerpc-linux-gnu.
>>>>>>
>>>>>> In this version (version 9) of the patch following review comments are 
>>>>>> incorporated.
>>>>>>
>>>>>> a) Removal of hard code zero_extend and sign_extend  in abi interfaces.
>>>>>> b) Source and destination with different registers are considered.
>>>>>> c) Further enhancements.
>>>>>> d) Added sign extension elimination using abi interfaces.
>>>>> As has been trend in the past, I don't think all the review comments have 
>>>>> been addressed.
>>>> And apart from that, may I ask if this is just me, or does anybody else 
>>>> think that it might be worthwhile to actually read a patch before 
>>>> (re-)posting?
>>>>
>>>> Seeing e.g. the proposed abi_extension_candidate_p as written in a first 
>>>> POC would deserve some manual CSE, if nothing more then for clarity and 
>>>> conciseness?
>>>>
>>>> Just curious from a meta perspective..
>>>>
>>>> And:
>>>>
>>>>>> ree: Improve ree pass for rs6000 target using defined abi interfaces
>>>> mentioning powerpc like this, and then changing generic code could be 
>>>> interpreted as misleading, IMHO.
>>>>
>>>>>> For rs6000 target we see redundant zero and sign extension and done
>>>>>> to improve ree pass to eliminate such redundant zero and sign extension
>>>>>> using defined ABI interfaces.
>>>> Mentioning powerpc in the body as one of the affected target(s) is of 
>>>> course fine.
>>>>
>>>>
>>>>>>    +/* Return TRUE if target mode is equal to source mode of zero_extend
>>>>>> +   or sign_extend otherwise false.  */
>>>> , false otherwise.
>>>>
>>>> But I'm not a native speaker
>>>

Re: PATCH v6 4/4] ree: Improve ree pass for rs6000 target using defined ABI interfaces.

2023-10-24 Thread Ajit Agarwal



On 19/09/23 1:57 am, Vineet Gupta wrote:
> Hi Ajit,
> 
> On 9/17/23 22:59, Ajit Agarwal wrote:
>> This new version of patch 6 use improve ree pass for rs6000 target using 
>> defined ABI interfaces.
>> Bootstrapped and regtested on power64-linux-gnu.
>>
>> Review comments incorporated.
>>
>> Thanks & Regards
>> Ajit
> 
> Nit: This seems to belong to "what changed in v6" between the two "---" lines 
> right before start of source diff.

Addressed in V13 of the patch.
> 
>> ree: Improve ree pass for rs6000 target using defined abi interfaces
>>
>> For rs6000 target we see redundant zero and sign extension and done to
>> improve ree pass to eliminate such redundant zero and sign extension
>> using defined ABI interfaces.
> 
> It seems you have redundant "redundant zero and sign extension" - pun 
> intended  ;-)
> 
> On a serious note, when debugging your code for a possible RISC-V benefit, it 
> seems what it is trying to do is address REE giving up due to "missing 
> definition(s)". Perhaps mentioning that in commitlog would give the reader 
> more context.

Addressed in V13 of the patch.
> 
>> +/* Return TRUE if target mode is equal to source mode of zero_extend
>> +   or sign_extend otherwise false.  */
>> +
>> +static bool
>> +abi_target_promote_function_mode (machine_mode mode)
>> +{
>> +  int unsignedp;
>> +  machine_mode tgt_mode =
>> +    targetm.calls.promote_function_mode (NULL_TREE, mode, &unsignedp,
>> + NULL_TREE, 1);
>> +
>> +  if (tgt_mode == mode)
>> +    return true;
>> +  else
>> +    return false;
>> +}
>> +
>> +/* Return TRUE if the candidate insn is zero extend and regno is
>> +   an return  registers.  */
> 
> Additional Whitespace and grammer
> s/an return  registers/a return register
> 

Addressed in V12 of the patch.

> Please *run* contrib/check_gnu_style on your patch before sending out on 
> mailing lists, saves reviewers time and they can focus more on technical 
> content.
> 
>> +
>> +static bool
>> +abi_extension_candidate_return_reg_p (rtx_insn *insn, int regno)
>> +{
>> +  rtx set = single_set (insn);
>> +
>> +  if (GET_CODE (SET_SRC (set)) != ZERO_EXTEND)
>> +    return false;
> 
> This still has ABI assumptions: RISC-V generates SIGN_EXTEND for functions 
> args and return reg.
> This is not a deficiency of patch per-se, but something we would like to 
> address - even if as an addon-patch.
>

Already addressed in V13 of the patch.
 
>> +
>> +  if (FUNCTION_VALUE_REGNO_P (regno))
>> +    return true;
>> +
>> +  return false;
>> +}
>> +
>> +/* Return TRUE if reg source operand of zero_extend is argument registers
>> +   and not return registers and source and destination operand are same
>> +   and mode of source and destination operand are not same.  */
>> +
>> +static bool
>> +abi_extension_candidate_p (rtx_insn *insn)
>> +{
>> +  rtx set = single_set (insn);
>> +
>> +  if (GET_CODE (SET_SRC (set)) != ZERO_EXTEND)
>> +    return false;
> Ditto: ABI assumption.
> 

Already addressed in V12 of the patch.

>> +
>> +  machine_mode ext_dst_mode = GET_MODE (SET_DEST (set));
> 
> why not simply @dst_mode
> 
>> +  rtx orig_src = XEXP (SET_SRC (set),0);
>> +
>> +  bool copy_needed
>> +    = (REGNO (SET_DEST (set)) != REGNO (XEXP (SET_SRC (set), 0)));
> 
> Maybe use @orig_src here, rather than duplicating XEXP (SET_SRC (set),0)
>

Already addressed.
 
>> +  if (!copy_needed && ext_dst_mode != GET_MODE (orig_src)
> 
> The bailing out for copy_needed needs extra commentary, why ?
> 
>> +  && FUNCTION_ARG_REGNO_P (REGNO (orig_src))
>> +  && !abi_extension_candidate_return_reg_p (insn, REGNO (orig_src)))
>> +    return true;
>> +
>> +  return false;
> 
> Consider this bike-shed but I would arrange this code differently. The main 
> case here is check for function args and then the not so imp reasons
> 
> +  rtx orig_src = XEXP (src, 0);
> +
> +  if (!FUNCTION_ARG_REGNO_P (REGNO (orig_src))
> +  || abi_extension_candidate_return_reg_p (insn, REGNO (orig_src)))
> +    return false;
> +
> +  /* commentary as to why  */
> +  if (dst_mode == GET_MODE (orig_src))
> +    return false;
> 
> -   bool copy_needed
> -    = (REGNO (SET_DEST (set)) != REGNO (XEXP (SET_SRC (set), 0)));
> +  /* copy needed  . */
> +  if (REGNO (SET_DEST (set)) != REGNO (orig_src))
> +    return false;
&g

[PATCH V14 4/4] ree: Improve ree pass using defined abi interfaces

2023-10-24 Thread Ajit Agarwal
Hello Vineet, Jeff and Bernhard:

This version 14 of the patch uses abi interfaces to remove zero and sign 
extension elimination.
This fixes aarch64 regressions failures with aggressive CSE.

Bootstrapped and regtested on powerpc-linux-gnu.

In this version (version 14) of the patch following review comments are 
incorporated.

a) Removal of hard code zero_extend and sign_extend  in abi interfaces.
b) Source and destination with different registers are considered.
c) Further enhancements.
d) Added sign extension elimination using abi interfaces.
d) Addressed remaining review comments from Vineet.
e) Addressed review comments from Bernhard.
f) Fix aarch64 regressions failure.

Please let me know if there is anything missing in this patch.

Ok for trunk?

Thanks & Regards
Ajit

ree: Improve ree pass using defined abi interfaces

For rs6000 target we see zero and sign extend with missing
definitions. Improved to eliminate such zero and sign extension
using defined ABI interfaces.

2023-10-24  Ajit Kumar Agarwal  

gcc/ChangeLog:

* ree.cc (combine_reaching_defs): Eliminate zero_extend and sign_extend
using defined abi interfaces.
(add_removable_extension): Use of defined abi interfaces for no
reaching defs.
(abi_extension_candidate_return_reg_p): New function.
(abi_extension_candidate_p): New function.
(abi_extension_candidate_argno_p): New function.
(abi_handle_regs): New function.
(abi_target_promote_function_mode): New function.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/zext-elim-3.C
---
changes since v6:
  - Added missing abi interfaces.
  - Rearranging and restructuring the code.
  - Removal of hard coded zero extend and sign extend in abi interfaces.
  - Relaxed different registers with source and destination in abi interfaces.
  - Using CSE in abi interfaces.
  - Fix aarch64 regressions.
  - Add Sign extension removal in abi interfaces.
  - Modified comments as per coding convention.
  - Modified code as per coding convention.
  - Fix bug bootstrapping RISCV failures.
---
 gcc/ree.cc| 147 +-
 .../g++.target/powerpc/zext-elim-3.C  |  13 ++
 2 files changed, 154 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-3.C

diff --git a/gcc/ree.cc b/gcc/ree.cc
index fc04249fa84..f557b49b366 100644
--- a/gcc/ree.cc
+++ b/gcc/ree.cc
@@ -514,7 +514,8 @@ get_uses (rtx_insn *insn, rtx reg)
 if (REGNO (DF_REF_REG (def)) == REGNO (reg))
   break;
 
-  gcc_assert (def != NULL);
+  if (def == NULL)
+return NULL;
 
   ref_chain = DF_REF_CHAIN (def);
 
@@ -750,6 +751,120 @@ get_extended_src_reg (rtx src)
   return src;
 }
 
+/* Return TRUE if target mode is equal to source mode, false otherwise.  */
+
+static bool
+abi_target_promote_function_mode (machine_mode mode)
+{
+  int unsignedp;
+  machine_mode tgt_mode
+= targetm.calls.promote_function_mode (NULL_TREE, mode, &unsignedp,
+  NULL_TREE, 1);
+
+  if (tgt_mode == mode)
+return true;
+  else
+return false;
+}
+
+/* Return TRUE if regno is a return register.  */
+
+static inline bool
+abi_extension_candidate_return_reg_p (int regno)
+{
+  if (targetm.calls.function_value_regno_p (regno))
+return true;
+
+  return false;
+}
+
+/* Return TRUE if the following conditions are satisfied.
+
+  a) reg source operand is argument register and not return register.
+  b) mode of source and destination operand are different.
+  c) if not promoted REGNO of source and destination operand are same.  */
+
+static bool
+abi_extension_candidate_p (rtx_insn *insn)
+{
+  rtx set = single_set (insn);
+  machine_mode dst_mode = GET_MODE (SET_DEST (set));
+  rtx orig_src = XEXP (SET_SRC (set), 0);
+
+  if (!FUNCTION_ARG_REGNO_P (REGNO (orig_src))
+  || abi_extension_candidate_return_reg_p (REGNO (orig_src)))
+return false;
+
+  /* Return FALSE if mode of destination and source is same.  */
+  if (dst_mode == GET_MODE (orig_src))
+return false;
+
+  machine_mode mode = GET_MODE (XEXP (SET_SRC (set), 0));
+  bool promote_p = abi_target_promote_function_mode (mode);
+
+  /* Return FALSE if promote is false and REGNO of source and destination
+ is different.  */
+  if (!promote_p && REGNO (SET_DEST (set)) != REGNO (orig_src))
+return false;
+
+  return true;
+}
+
+/* Return TRUE if regno is an argument register.  */
+
+static inline bool
+abi_extension_candidate_argno_p (int regno)
+{
+  return FUNCTION_ARG_REGNO_P (regno);
+}
+
+/* Return TRUE if the candidate insn doesn't have defs and have
+ * uses without RTX_BIN_ARITH/RTX_COMM_ARITH/RTX_UNARY rtx class.  */
+
+static bool
+abi_handle_regs (rtx_insn *insn)
+{
+  if (side_effects_p (PATTERN (insn)))
+return false;
+
+  struct df_link *uses = get_uses (insn, SET_DEST (PATTERN (insn)));
+
+  if (!uses)
+return false;
+
+  for (df_link *use = uses; use; use = use->n

Re: [PATCH v9 4/4] ree: Improve ree pass for rs6000 target using defined ABI interfaces

2023-10-25 Thread Ajit Agarwal



On 25/10/23 2:19 am, Vineet Gupta wrote:
> On 10/24/23 13:36, rep.dot@gmail.com wrote:
>> As said, I don't see why the below was not cleaned up before the V1 
>> submission.
>> Iff it breaks when manually CSEing, I'm curious why?
 The function below looks identical in v12 of the patch.
 Why didn't you use common subexpressions?
 ba
>>> Using CSE here breaks aarch64 regressions hence I have reverted it back
>>> not to use CSE,
>> Just for my own education, can you please paste your patch perusing common 
>> subexpressions and an assembly diff of the failing versus working aarch64 
>> testcase, along how you configured that failing (cross-?)compiler and the 
>> command-line of a typical testcase that broke when manually CSEing the 
>> function below?
> 
> I was meaning to ask this before, but what exactly is the CSE issue, manually 
> or whatever.
> 
Here is the abi interface where I CSE'D and got a mail from automated 
regressions run that aarch64
test fails.

static inline bool
abi_extension_candidate_return_reg_p (int regno)
{
  if (targetm.calls.function_value_regno_p (regno))
return true;

  return false;
}

+static inline bool
+abi_extension_candidate_return_reg_p (int regno)
+{
+  return targetm.calls.function_value_regno_p (regno);
+}


I have not done any assembly diff as myself have not cross compiled with 
aarch64.
Reverting above CSE the tests passes with automatically regression runs and 
build with linaro.
Linaro runs the tests with every patch we submit in gcc-patches and if there is 
any fail they 
report error.

Reverting CSE the Linaro tests passes.

Thanks & Regards
Ajit

> -Vineet


Re: [PATCH v9 4/4] ree: Improve ree pass for rs6000 target using defined ABI interfaces

2023-10-25 Thread Ajit Agarwal



On 25/10/23 2:06 am, rep.dot@gmail.com wrote:
> On 24 October 2023 09:36:22 CEST, Ajit Agarwal  wrote:
>> Hello Bernhard:
>>
>> On 23/10/23 7:40 pm, Bernhard Reutner-Fischer wrote:
>>> On Mon, 23 Oct 2023 12:16:18 +0530
>>> Ajit Agarwal  wrote:
>>>
>>>> Hello All:
>>>>
>>>> Addressed below review comments in the version 11 of the patch.
>>>> Please review and please let me know if its ok for trunk.
>>>
>>> s/satisified/satisfied/
>>>
>>
>> I will fix this.
> 
> thanks!
> 
>>
>>>>> As said, I don't see why the below was not cleaned up before the V1 
>>>>> submission.
>>>>> Iff it breaks when manually CSEing, I'm curious why?
>>>
>>> The function below looks identical in v12 of the patch.
>>> Why didn't you use common subexpressions?
>>> ba
>>
>> Using CSE here breaks aarch64 regressions hence I have reverted it back 
>> not to use CSE,
> 
> Just for my own education, can you please paste your patch perusing common 
> subexpressions and an assembly diff of the failing versus working aarch64 
> testcase, along how you configured that failing (cross-?)compiler and the 
> command-line of a typical testcase that broke when manually CSEing the 
> function below?
> 
> I might have not completely understood the subtile intricacies of RTL 
> re-entrancy, it seems?
> 

Here is the abi interface where I CSE'D and got a mail from automated 
regressions run that aarch64
test fails.

static inline bool
abi_extension_candidate_return_reg_p (int regno)
{
  if (targetm.calls.function_value_regno_p (regno))
return true;

  return false;
}

+static inline bool
+abi_extension_candidate_return_reg_p (int regno)
+{
+  return targetm.calls.function_value_regno_p (regno);
+}


I have not done any assembly diff as myself have not cross compiled with 
aarch64.
Reverting above CSE the tests passes with automatically regression runs and 
build with linaro.
Linaro runs the tests with every patch we submit in gcc-patches and if there is 
any fail they 
report error.

Reverting CSE the Linaro tests passes.

Thanks & Regards
Ajit
> thanks
> 
>>>>>   
>>>>>>> +/* Return TRUE if reg source operand of zero_extend is argument 
>>>>>>> registers
>>>>>>> +   and not return registers and source and destination operand are same
>>>>>>> +   and mode of source and destination operand are not same.  */
>>>>>>> +
>>>>>>> +static bool
>>>>>>> +abi_extension_candidate_p (rtx_insn *insn)
>>>>>>> +{
>>>>>>> +  rtx set = single_set (insn);
>>>>>>> +  machine_mode dst_mode = GET_MODE (SET_DEST (set));
>>>>>>> +  rtx orig_src = XEXP (SET_SRC (set), 0);
>>>>>>> +
>>>>>>> +  if (!FUNCTION_ARG_REGNO_P (REGNO (orig_src))
>>>>>>> +  || abi_extension_candidate_return_reg_p (/*insn,*/ REGNO 
>>>>>>> (orig_src)))  
>>>>>>> +return false;
>>>>>>> +
>>>>>>> +  /* Mode of destination and source should be different.  */
>>>>>>> +  if (dst_mode == GET_MODE (orig_src))
>>>>>>> +return false;
>>>>>>> +
>>>>>>> +  machine_mode mode = GET_MODE (XEXP (SET_SRC (set), 0));
>>>>>>> +  bool promote_p = abi_target_promote_function_mode (mode);
>>>>>>> +
>>>>>>> +  /* REGNO of source and destination should be same if not
>>>>>>> +  promoted.  */
>>>>>>> +  if (!promote_p && REGNO (SET_DEST (set)) != REGNO (orig_src))
>>>>>>> +return false;
>>>>>>> +
>>>>>>> +  return true;
>>>>>>> +}
>>>>>>> +  
>>>
>>>
>>>>>
>>>>> As said, please also rephrase the above (and everything else if it 
>>>>> obviously looks akin the above).
>>>
>>> thanks
> 


Re: [PATCH V14 4/4] ree: Improve ree pass using defined abi interfaces

2023-10-25 Thread Ajit Agarwal



On 24/10/23 11:47 pm, Vineet Gupta wrote:
> 
> 
> On 10/24/23 10:03, Ajit Agarwal wrote:
>> Hello Vineet, Jeff and Bernhard:
>>
>> This version 14 of the patch uses abi interfaces to remove zero and sign 
>> extension elimination.
>> This fixes aarch64 regressions failures with aggressive CSE.
> 
> Once again, this information belong between the two "---" lines that you 
> added for v6 and stopped updating.
> 
> And it seems the only code difference between v13 and v14 is
> 
> -  return tgt_mode == mode;
> +  if (tgt_mode == mode)
> +    return true;
> +  else
> +    return false;
> 
> How does that make any difference ?

In V14 of the patch I reverted the CSE done v13 of the patch.
This is because I got a mail from Linaro with Linaro regressions fails. 
Then I got a sorry mail saying there were some errands at there end and ask me 
to ignore.

Please review the V13 of the patch with CSE'd and please let me know if this 
okay for trunk.

Thanks & Regards
Ajit


> 
> -Vineet
> 
>>
>> Bootstrapped and regtested on powerpc-linux-gnu.
>>
>> In this version (version 14) of the patch following review comments are 
>> incorporated.
>>
>> a) Removal of hard code zero_extend and sign_extend  in abi interfaces.
>> b) Source and destination with different registers are considered.
>> c) Further enhancements.
>> d) Added sign extension elimination using abi interfaces.
>> d) Addressed remaining review comments from Vineet.
>> e) Addressed review comments from Bernhard.
>> f) Fix aarch64 regressions failure.
>>
>> Please let me know if there is anything missing in this patch.
>>
>> Ok for trunk?
>>
>> Thanks & Regards
>> Ajit
>>
>> ree: Improve ree pass using defined abi interfaces
>>
>> For rs6000 target we see zero and sign extend with missing
>> definitions. Improved to eliminate such zero and sign extension
>> using defined ABI interfaces.
>>
>> 2023-10-24  Ajit Kumar Agarwal  
>>
>> gcc/ChangeLog:
>>
>>  * ree.cc (combine_reaching_defs): Eliminate zero_extend and 
>> sign_extend
>>  using defined abi interfaces.
>>  (add_removable_extension): Use of defined abi interfaces for no
>>  reaching defs.
>>  (abi_extension_candidate_return_reg_p): New function.
>>  (abi_extension_candidate_p): New function.
>>  (abi_extension_candidate_argno_p): New function.
>>  (abi_handle_regs): New function.
>>  (abi_target_promote_function_mode): New function.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * g++.target/powerpc/zext-elim-3.C
>> ---
>> changes since v6:
>>    - Added missing abi interfaces.
>>    - Rearranging and restructuring the code.
>>    - Removal of hard coded zero extend and sign extend in abi interfaces.
>>    - Relaxed different registers with source and destination in abi 
>> interfaces.
>>    - Using CSE in abi interfaces.
>>    - Fix aarch64 regressions.
>>    - Add Sign extension removal in abi interfaces.
>>    - Modified comments as per coding convention.
>>    - Modified code as per coding convention.
>>    - Fix bug bootstrapping RISCV failures.
>> ---
>>   gcc/ree.cc    | 147 +-
>>   .../g++.target/powerpc/zext-elim-3.C  |  13 ++
>>   2 files changed, 154 insertions(+), 6 deletions(-)
>>   create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-3.C
>>
>> diff --git a/gcc/ree.cc b/gcc/ree.cc
>> index fc04249fa84..f557b49b366 100644
>> --- a/gcc/ree.cc
>> +++ b/gcc/ree.cc
>> @@ -514,7 +514,8 @@ get_uses (rtx_insn *insn, rtx reg)
>>   if (REGNO (DF_REF_REG (def)) == REGNO (reg))
>>     break;
>>   -  gcc_assert (def != NULL);
>> +  if (def == NULL)
>> +    return NULL;
>>       ref_chain = DF_REF_CHAIN (def);
>>   @@ -750,6 +751,120 @@ get_extended_src_reg (rtx src)
>>     return src;
>>   }
>>   +/* Return TRUE if target mode is equal to source mode, false otherwise.  
>> */
>> +
>> +static bool
>> +abi_target_promote_function_mode (machine_mode mode)
>> +{
>> +  int unsignedp;
>> +  machine_mode tgt_mode
>> +    = targetm.calls.promote_function_mode (NULL_TREE, mode, &unsignedp,
>> +   NULL_TREE, 1);
>> +
>> +  if (tgt_mode == mode)
>> +    return true;
>> +  else
>> +    return false;
>> +}
>> +
>> +/* Return TRUE if regno is a return register.  */
&g

[PATCH V15 4/4] ree: Improve ree pass using defined abi interfaces

2023-10-28 Thread Ajit Agarwal
Hello Vineet, Jeff and Bernhard:

This version 15 of the patch uses abi interfaces to remove zero and sign 
extension elimination.
Bootstrapped and regtested on powerpc-linux-gnu.

In this version (version 15) of the patch following review comments are 
incorporated.

a) Removal of hard code zero_extend and sign_extend  in abi interfaces.
b) Source and destination with different registers are considered.
c) Further enhancements.
d) Added sign extension elimination using abi interfaces.
d) Addressed remaining review comments from Vineet.
e) Addressed review comments from Bernhard.
f) Fix aarch64 regressions failure.

Please let me know if there is anything missing in this patch.

Ok for trunk?

Thanks & Regards
Ajit

ree: Improve ree pass using defined abi interfaces

For rs6000 target we see zero and sign extend with missing
definitions. Improved to eliminate such zero and sign extension
using defined ABI interfaces.

2023-10-28  Ajit Kumar Agarwal  

gcc/ChangeLog:

* ree.cc (combine_reaching_defs): Eliminate zero_extend and sign_extend
using defined abi interfaces.
(add_removable_extension): Use of defined abi interfaces for no
reaching defs.
(abi_extension_candidate_return_reg_p): New function.
(abi_extension_candidate_p): New function.
(abi_extension_candidate_argno_p): New function.
(abi_handle_regs): New function.
(abi_target_promote_function_mode): New function.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/zext-elim-3.C
---
changes since v6:
  - Added missing abi interfaces.
  - Rearranging and restructuring the code.
  - Removal of hard coded zero extend and sign extend in abi interfaces.
  - Relaxed different registers with source and destination in abi interfaces.
  - Using CSE in abi interfaces.
  - Fix aarch64 regressions.
  - Add Sign extension removal in abi interfaces.
  - Modified comments as per coding convention.
  - Modified code as per coding convention.
  - Fix bug bootstrapping RISCV failures
---
 gcc/ree.cc| 133 +-
 .../g++.target/powerpc/zext-elim-3.C  |  13 ++
 2 files changed, 140 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-3.C

diff --git a/gcc/ree.cc b/gcc/ree.cc
index fc04249fa84..b0fb37102ae 100644
--- a/gcc/ree.cc
+++ b/gcc/ree.cc
@@ -514,7 +514,8 @@ get_uses (rtx_insn *insn, rtx reg)
 if (REGNO (DF_REF_REG (def)) == REGNO (reg))
   break;
 
-  gcc_assert (def != NULL);
+  if (def == NULL)
+return NULL;
 
   ref_chain = DF_REF_CHAIN (def);
 
@@ -750,6 +751,106 @@ get_extended_src_reg (rtx src)
   return src;
 }
 
+/* Return TRUE if target mode is equal to source mode, false otherwise.  */
+
+static bool
+abi_target_promote_function_mode (machine_mode mode)
+{
+  int unsignedp;
+  machine_mode tgt_mode
+= targetm.calls.promote_function_mode (NULL_TREE, mode, &unsignedp,
+  NULL_TREE, 1);
+
+  return tgt_mode == mode;
+}
+
+/* Return TRUE if regno is a return register.  */
+
+static inline bool
+abi_extension_candidate_return_reg_p (int regno)
+{
+  if (targetm.calls.function_value_regno_p (regno))
+return true;
+
+  return false;
+}
+
+/* Return TRUE if
+   reg source operand is argument register and not return register,
+   mode of source and destination operand are different,
+   if not promoted REGNO of source and destination operand are the same.  */
+static bool
+abi_extension_candidate_p (rtx_insn *insn)
+{
+  rtx set = single_set (insn);
+  machine_mode dst_mode = GET_MODE (SET_DEST (set));
+  rtx orig_src = XEXP (SET_SRC (set), 0);
+
+  if (FUNCTION_ARG_REGNO_P (REGNO (orig_src))
+  && !abi_extension_candidate_return_reg_p (REGNO (orig_src))
+  && dst_mode != GET_MODE (orig_src)
+  && abi_target_promote_function_mode (GET_MODE (orig_src))
+  && REGNO (SET_DEST (set)) == REGNO (orig_src))
+return true;
+
+  return false;
+}
+
+/* Return TRUE if regno is an argument register.  */
+
+static inline bool
+abi_extension_candidate_argno_p (int regno)
+{
+  return FUNCTION_ARG_REGNO_P (regno);
+}
+
+/* Return TRUE if the candidate insn doesn't have defs and have
+ * uses without RTX_BIN_ARITH/RTX_COMM_ARITH/RTX_UNARY rtx class.  */
+
+static bool
+abi_handle_regs (rtx_insn *insn)
+{
+  if (side_effects_p (PATTERN (insn)))
+return false;
+
+  struct df_link *uses = get_uses (insn, SET_DEST (PATTERN (insn)));
+
+  if (!uses)
+return false;
+
+  for (df_link *use = uses; use; use = use->next)
+{
+  if (!use->ref)
+   return false;
+
+  if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (DF_REF_INSN (use->ref)))
+   return false;
+
+  rtx_insn *use_insn = DF_REF_INSN (use->ref);
+
+  if (GET_CODE (PATTERN (use_insn)) == SET)
+   {
+ rtx_code code = GET_CODE (SET_SRC (PATTERN (use_insn)));
+
+ if (GET_RTX_CLASS (code) == RTX_BIN_ARITH
+ || GET_RTX_CLASS (

Re: [PATCH v9 4/4] ree: Improve ree pass for rs6000 target using defined ABI interfaces

2023-10-28 Thread Ajit Agarwal



On 27/10/23 10:46 pm, Bernhard Reutner-Fischer wrote:
> On Wed, 25 Oct 2023 16:41:07 +0530
> Ajit Agarwal  wrote:
> 
>> On 25/10/23 2:19 am, Vineet Gupta wrote:
>>> On 10/24/23 13:36, rep.dot@gmail.com wrote:  
>>>>>>>> As said, I don't see why the below was not cleaned up before the V1 
>>>>>>>> submission.
>>>>>>>> Iff it breaks when manually CSEing, I'm curious why?  
>>>>>> The function below looks identical in v12 of the patch.
>>>>>> Why didn't you use common subexpressions?
>>>>>> ba  
>>>>> Using CSE here breaks aarch64 regressions hence I have reverted it back
>>>>> not to use CSE,  
>>>> Just for my own education, can you please paste your patch perusing common 
>>>> subexpressions and an assembly diff of the failing versus working aarch64 
>>>> testcase, along how you configured that failing (cross-?)compiler and the 
>>>> command-line of a typical testcase that broke when manually CSEing the 
>>>> function below?  
>>>
>>> I was meaning to ask this before, but what exactly is the CSE issue, 
>>> manually or whatever.
> 
> If nothing else it would hopefully improve the readability.
> 
>>>   
>> Here is the abi interface where I CSE'D and got a mail from automated 
>> regressions run that aarch64
>> test fails.
> 
> We already concluded that this failure was obviously a hiccup on the
> testers, no problem.

Thanks.
> 
>> +static inline bool
>> +abi_extension_candidate_return_reg_p (int regno)
>> +{
>> +  return targetm.calls.function_value_regno_p (regno);
>> +}
> 
> But i was referring to abi_extension_candidate_p :)
> 
> your v13 looks like this:
> 
> +static bool
> +abi_extension_candidate_p (rtx_insn *insn)
> +{
> +  rtx set = single_set (insn);
> +  machine_mode dst_mode = GET_MODE (SET_DEST (set));
> +  rtx orig_src = XEXP (SET_SRC (set), 0);
> +
> +  if (!FUNCTION_ARG_REGNO_P (REGNO (orig_src))
> +  || abi_extension_candidate_return_reg_p (REGNO (orig_src)))
> +return false;
> +
> +  /* Return FALSE if mode of destination and source is same.  */
> +  if (dst_mode == GET_MODE (orig_src))
> +return false;
> +
> +  machine_mode mode = GET_MODE (XEXP (SET_SRC (set), 0));
> +  bool promote_p = abi_target_promote_function_mode (mode);
> +
> +  /* Return FALSE if promote is false and REGNO of source and destination
> + is different.  */
> +  if (!promote_p && REGNO (SET_DEST (set)) != REGNO (orig_src))
> +return false;
> +
> +  return true;
> +}
> 
> and i suppose it would be easier to read if phrased something like
> 
> static bool
> abi_extension_candidate_p (rtx_insn *insn)
> {
>   rtx set = single_set (insn);
>   rtx orig_src = XEXP (SET_SRC (set), 0);
>   unsigned int src_regno = REGNO (orig_src);
> 
>   /* Not a function argument reg or is a function values return reg.  */
>   if (!FUNCTION_ARG_REGNO_P (src_regno)
>   || abi_extension_candidate_return_reg_p (src_regno))
> return false;
> 
>   rtx dst = SET_DST (set);
>   machine_mode src_mode = GET_MODE (orig_src);
> 
>   /* Return FALSE if mode of destination and source is the same.  */
>   if (GET_MODE (dst) == src_mode)
> return false;
> 
>   /* Return FALSE if the FIX THE COMMENT and REGNO of source and destination
>  is different.  */
>   if (!abi_target_promote_function_mode_p (src_mode)
>   && REGNO (dst) != src_regno)
> return false;
> 
>   return true;
> }
> 
> so no, that's not exactly better.
> 
> Maybe just do what the function comment says (i did not check the "not
> promoted" part, but you get the idea):
> 
> ^L
> 
> /* Return TRUE if
>reg source operand is argument register and not return register,
>mode of source and destination operand are different,
>if not promoted REGNO of source and destination operand are the same.  */
> static bool
> abi_extension_candidate_p (rtx_insn *insn)
> {
>   rtx set = single_set (insn);
>   rtx orig_src = XEXP (SET_SRC (set), 0);
> 
>   if (FUNCTION_ARG_REGNO_P (REGNO (orig_src))
>   && !abi_extension_candidate_return_reg_p (REGNO (orig_src))
>   && GET_MODE (SET_DST (set)) != GET_MODE (orig_src)
>   && abi_target_promote_function_mode_p (GET_MODE (orig_src))
>   && REGNO (SET_DST (set)) == REGNO (orig_src))
> return true;
> 
>   return false;
> }
> 
> I think this is much easier to actually read (and that's why good
> function comments are important). In the end it's not important and
> just personal preference.
> Either way, I did not check the plausibility of the logic therein.
> 
>>

Addressed in V15 of the patch. 
>>
>> I have not done any assembly diff as myself have not cross compiled with 
>> aarch64.
> 
> fair enough.


Re: [PATCH v9 4/4] ree: Improve ree pass for rs6000 target using defined ABI interfaces

2023-10-28 Thread Ajit Agarwal



On 28/10/23 4:09 am, Vineet Gupta wrote:
> 
> 
> On 10/27/23 10:16, Bernhard Reutner-Fischer wrote:
>> On Wed, 25 Oct 2023 16:41:07 +0530
>> Ajit Agarwal  wrote:
>>
>>> On 25/10/23 2:19 am, Vineet Gupta wrote:
>>>> On 10/24/23 13:36, rep.dot@gmail.com wrote:
>>>>>>>>> As said, I don't see why the below was not cleaned up before the V1 
>>>>>>>>> submission.
>>>>>>>>> Iff it breaks when manually CSEing, I'm curious why?
>>>>>>> The function below looks identical in v12 of the patch.
>>>>>>> Why didn't you use common subexpressions?
>>>>>>> ba
>>>>>> Using CSE here breaks aarch64 regressions hence I have reverted it back
>>>>>> not to use CSE,
>>>>> Just for my own education, can you please paste your patch perusing 
>>>>> common subexpressions and an assembly diff of the failing versus working 
>>>>> aarch64 testcase, along how you configured that failing (cross-?)compiler 
>>>>> and the command-line of a typical testcase that broke when manually 
>>>>> CSEing the function below?
>>>> I was meaning to ask this before, but what exactly is the CSE issue, 
>>>> manually or whatever.
>> If nothing else it would hopefully improve the readability.
>>
>>>>    
>>> Here is the abi interface where I CSE'D and got a mail from automated 
>>> regressions run that aarch64
>>> test fails.
>> We already concluded that this failure was obviously a hiccup on the
>> testers, no problem.
>>
>>> +static inline bool
>>> +abi_extension_candidate_return_reg_p (int regno)
>>> +{
>>> +  return targetm.calls.function_value_regno_p (regno);
>>> +}
>> But i was referring to abi_extension_candidate_p :)
>>
>> your v13 looks like this:
>>
>> +static bool
>> +abi_extension_candidate_p (rtx_insn *insn)
>> +{
>> +  rtx set = single_set (insn);
>> +  machine_mode dst_mode = GET_MODE (SET_DEST (set));
>> +  rtx orig_src = XEXP (SET_SRC (set), 0);
>> +
>> +  if (!FUNCTION_ARG_REGNO_P (REGNO (orig_src))
>> +  || abi_extension_candidate_return_reg_p (REGNO (orig_src)))
>> +    return false;
>> +
>> +  /* Return FALSE if mode of destination and source is same.  */
>> +  if (dst_mode == GET_MODE (orig_src))
>> +    return false;
>> +
>> +  machine_mode mode = GET_MODE (XEXP (SET_SRC (set), 0));
>> +  bool promote_p = abi_target_promote_function_mode (mode);
>> +
>> +  /* Return FALSE if promote is false and REGNO of source and destination
>> + is different.  */
>> +  if (!promote_p && REGNO (SET_DEST (set)) != REGNO (orig_src))
>> +    return false;
>> +
>> +  return true;
>> +}
>>
>> and i suppose it would be easier to read if phrased something like
>>
>> static bool
>> abi_extension_candidate_p (rtx_insn *insn)
>> {
>>    rtx set = single_set (insn);
>>    rtx orig_src = XEXP (SET_SRC (set), 0);
>>    unsigned int src_regno = REGNO (orig_src);
>>
>>    /* Not a function argument reg or is a function values return reg.  */
>>    if (!FUNCTION_ARG_REGNO_P (src_regno)
>>    || abi_extension_candidate_return_reg_p (src_regno))
>>  return false;
>>
>>    rtx dst = SET_DST (set);
>>    machine_mode src_mode = GET_MODE (orig_src);
>>
>>    /* Return FALSE if mode of destination and source is the same.  */
>>    if (GET_MODE (dst) == src_mode)
>>  return false;
>>
>>    /* Return FALSE if the FIX THE COMMENT and REGNO of source and destination
>>   is different.  */
>>    if (!abi_target_promote_function_mode_p (src_mode)
>>    && REGNO (dst) != src_regno)
>>  return false;
>>
>>    return true;
>> }
>>
>> so no, that's not exactly better.
>>
>> Maybe just do what the function comment says (i did not check the "not
>> promoted" part, but you get the idea):
>>
>> ^L
>>
>> /* Return TRUE if
>>     reg source operand is argument register and not return register,
>>     mode of source and destination operand are different,
>>     if not promoted REGNO of source and destination operand are the same.  */
>> static bool
>> abi_extension_candidate_p (rtx_insn *insn)
>> {
>>    rtx set = single_set (insn);
>>    rtx orig_src = XEXP (SET_SRC (set), 0);
>>
>>    if (FUNCTION_ARG_REG

[PATCH V15 4/4] ree: Improve ree pass using defined abi interfaces

2023-10-29 Thread Ajit Agarwal
Hello Vineet, Jeff and Bernhard:

This version 15 of the patch uses abi interfaces to remove zero and sign 
extension elimination.
Bootstrapped and regtested on powerpc-linux-gnu.

In this version (version 15) of the patch following review comments are 
incorporated.

a) Removal of hard code zero_extend and sign_extend  in abi interfaces.
b) Source and destination with different registers are considered.
c) Further enhancements.
d) Added sign extension elimination using abi interfaces.
d) Addressed remaining review comments from Vineet.
e) Addressed review comments from Bernhard.
f) Fix aarch64 regressions failure.

Please let me know if there is anything missing in this patch.

Ok for trunk?

Thanks & Regards
Ajit

ree: Improve ree pass using defined abi interfaces

For rs6000 target we see zero and sign extend with missing
definitions. Improved to eliminate such zero and sign extension
using defined ABI interfaces.

2023-10-29  Ajit Kumar Agarwal  

gcc/ChangeLog:

* ree.cc (combine_reaching_defs): Eliminate zero_extend and sign_extend
using defined abi interfaces.
(add_removable_extension): Use of defined abi interfaces for no
reaching defs.
(abi_extension_candidate_return_reg_p): New function.
(abi_extension_candidate_p): New function.
(abi_extension_candidate_argno_p): New function.
(abi_handle_regs): New function.
(abi_target_promote_function_mode): New function.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/zext-elim-3.C
---
changes since v6:
  - Added missing abi interfaces.
  - Rearranging and restructuring the code.
  - Removal of hard coded zero extend and sign extend in abi interfaces.
  - Relaxed different registers with source and destination in abi interfaces.
  - Using CSE in abi interfaces.
  - Fix aarch64 regressions.
  - Add Sign extension removal in abi interfaces.
  - Modified comments as per coding convention.
  - Modified code as per coding convention.
  - Fix bug bootstrapping RISCV failures
---
 gcc/ree.cc| 136 +-
 .../g++.target/powerpc/zext-elim-3.C  |  13 ++
 2 files changed, 143 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-3.C

diff --git a/gcc/ree.cc b/gcc/ree.cc
index fc04249fa84..6af82093eaf 100644
--- a/gcc/ree.cc
+++ b/gcc/ree.cc
@@ -514,7 +514,8 @@ get_uses (rtx_insn *insn, rtx reg)
 if (REGNO (DF_REF_REG (def)) == REGNO (reg))
   break;
 
-  gcc_assert (def != NULL);
+  if (def == NULL)
+return NULL;
 
   ref_chain = DF_REF_CHAIN (def);
 
@@ -750,6 +751,109 @@ get_extended_src_reg (rtx src)
   return src;
 }
 
+/* Return TRUE if target mode is equal to source mode, false otherwise.  */
+
+static bool
+abi_target_promote_function_mode (machine_mode mode)
+{
+  int unsignedp;
+  machine_mode tgt_mode
+= targetm.calls.promote_function_mode (NULL_TREE, mode, &unsignedp,
+  NULL_TREE, 1);
+
+  return tgt_mode == mode;
+}
+
+/* Return TRUE if regno is a return register.  */
+
+static inline bool
+abi_extension_candidate_return_reg_p (int regno)
+{
+  if (targetm.calls.function_value_regno_p (regno))
+return true;
+
+  return false;
+}
+
+/* Return TRUE if
+   reg source operand is argument register and not return register,
+   mode of source and destination operand are different,
+   if not promoted REGNO of source and destination operand are the same.  */
+static bool
+abi_extension_candidate_p (rtx_insn *insn)
+{
+  rtx set = single_set (insn);
+  machine_mode dst_mode = GET_MODE (SET_DEST (set));
+  rtx orig_src = XEXP (SET_SRC (set), 0);
+
+  if (FUNCTION_ARG_REGNO_P (REGNO (orig_src))
+  && !abi_extension_candidate_return_reg_p (REGNO (orig_src))
+  && dst_mode != GET_MODE (orig_src))
+ {
+   if (!abi_target_promote_function_mode (GET_MODE (orig_src))
+  && REGNO (SET_DEST (set)) != REGNO (orig_src))
+return false;
+
+   return true;
+ }
+  return false;
+}
+
+/* Return TRUE if regno is an argument register.  */
+
+static inline bool
+abi_extension_candidate_argno_p (int regno)
+{
+  return FUNCTION_ARG_REGNO_P (regno);
+}
+
+/* Return TRUE if the candidate insn doesn't have defs and have
+ * uses without RTX_BIN_ARITH/RTX_COMM_ARITH/RTX_UNARY rtx class.  */
+
+static bool
+abi_handle_regs (rtx_insn *insn)
+{
+  if (side_effects_p (PATTERN (insn)))
+return false;
+
+  struct df_link *uses = get_uses (insn, SET_DEST (PATTERN (insn)));
+
+  if (!uses)
+return false;
+
+  for (df_link *use = uses; use; use = use->next)
+{
+  if (!use->ref)
+   return false;
+
+  if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (DF_REF_INSN (use->ref)))
+   return false;
+
+  rtx_insn *use_insn = DF_REF_INSN (use->ref);
+
+  if (GET_CODE (PATTERN (use_insn)) == SET)
+   {
+ rtx_code code = GET_CODE (SET_SRC (PATTERN (use_insn)));
+
+ if (GET_RTX_CLASS (code)

Re: [PATCH v9 4/4] ree: Improve ree pass for rs6000 target using defined ABI interfaces

2023-10-29 Thread Ajit Agarwal



On 28/10/23 3:55 pm, Ajit Agarwal wrote:
> 
> 
> On 27/10/23 10:46 pm, Bernhard Reutner-Fischer wrote:
>> On Wed, 25 Oct 2023 16:41:07 +0530
>> Ajit Agarwal  wrote:
>>
>>> On 25/10/23 2:19 am, Vineet Gupta wrote:
>>>> On 10/24/23 13:36, rep.dot@gmail.com wrote:  
>>>>>>>>> As said, I don't see why the below was not cleaned up before the V1 
>>>>>>>>> submission.
>>>>>>>>> Iff it breaks when manually CSEing, I'm curious why?  
>>>>>>> The function below looks identical in v12 of the patch.
>>>>>>> Why didn't you use common subexpressions?
>>>>>>> ba  
>>>>>> Using CSE here breaks aarch64 regressions hence I have reverted it back
>>>>>> not to use CSE,  
>>>>> Just for my own education, can you please paste your patch perusing 
>>>>> common subexpressions and an assembly diff of the failing versus working 
>>>>> aarch64 testcase, along how you configured that failing (cross-?)compiler 
>>>>> and the command-line of a typical testcase that broke when manually 
>>>>> CSEing the function below?  
>>>>
>>>> I was meaning to ask this before, but what exactly is the CSE issue, 
>>>> manually or whatever.
>>
>> If nothing else it would hopefully improve the readability.
>>
>>>>   
>>> Here is the abi interface where I CSE'D and got a mail from automated 
>>> regressions run that aarch64
>>> test fails.
>>
>> We already concluded that this failure was obviously a hiccup on the
>> testers, no problem.
> 
> Thanks.
>>
>>> +static inline bool
>>> +abi_extension_candidate_return_reg_p (int regno)
>>> +{
>>> +  return targetm.calls.function_value_regno_p (regno);
>>> +}
>>
>> But i was referring to abi_extension_candidate_p :)
>>
>> your v13 looks like this:
>>
>> +static bool
>> +abi_extension_candidate_p (rtx_insn *insn)
>> +{
>> +  rtx set = single_set (insn);
>> +  machine_mode dst_mode = GET_MODE (SET_DEST (set));
>> +  rtx orig_src = XEXP (SET_SRC (set), 0);
>> +
>> +  if (!FUNCTION_ARG_REGNO_P (REGNO (orig_src))
>> +  || abi_extension_candidate_return_reg_p (REGNO (orig_src)))
>> +return false;
>> +
>> +  /* Return FALSE if mode of destination and source is same.  */
>> +  if (dst_mode == GET_MODE (orig_src))
>> +return false;
>> +
>> +  machine_mode mode = GET_MODE (XEXP (SET_SRC (set), 0));
>> +  bool promote_p = abi_target_promote_function_mode (mode);
>> +
>> +  /* Return FALSE if promote is false and REGNO of source and destination
>> + is different.  */
>> +  if (!promote_p && REGNO (SET_DEST (set)) != REGNO (orig_src))
>> +return false;
>> +
>> +  return true;
>> +}
>>
>> and i suppose it would be easier to read if phrased something like
>>
>> static bool
>> abi_extension_candidate_p (rtx_insn *insn)
>> {
>>   rtx set = single_set (insn);
>>   rtx orig_src = XEXP (SET_SRC (set), 0);
>>   unsigned int src_regno = REGNO (orig_src);
>>
>>   /* Not a function argument reg or is a function values return reg.  */
>>   if (!FUNCTION_ARG_REGNO_P (src_regno)
>>   || abi_extension_candidate_return_reg_p (src_regno))
>> return false;
>>
>>   rtx dst = SET_DST (set);
>>   machine_mode src_mode = GET_MODE (orig_src);
>>
>>   /* Return FALSE if mode of destination and source is the same.  */
>>   if (GET_MODE (dst) == src_mode)
>> return false;
>>
>>   /* Return FALSE if the FIX THE COMMENT and REGNO of source and destination
>>  is different.  */
>>   if (!abi_target_promote_function_mode_p (src_mode)
>>   && REGNO (dst) != src_regno)
>> return false;
>>
>>   return true;
>> }
>>
>> so no, that's not exactly better.
>>
>> Maybe just do what the function comment says (i did not check the "not
>> promoted" part, but you get the idea):
>>
>> ^L
>>
>> /* Return TRUE if
>>reg source operand is argument register and not return register,
>>mode of source and destination operand are different,
>>if not promoted REGNO of source and destination operand are the same.  */
>> static bool
>> abi_extension_candidate_p (rtx_insn *insn)
>> {
>>   rtx set = single_set (insn);
>>   rtx or

Re: [PATCH v9 4/4] ree: Improve ree pass for rs6000 target using defined ABI interfaces

2023-10-29 Thread Ajit Agarwal



On 28/10/23 3:56 pm, Ajit Agarwal wrote:
> 
> 
> On 28/10/23 4:09 am, Vineet Gupta wrote:
>>
>>
>> On 10/27/23 10:16, Bernhard Reutner-Fischer wrote:
>>> On Wed, 25 Oct 2023 16:41:07 +0530
>>> Ajit Agarwal  wrote:
>>>
>>>> On 25/10/23 2:19 am, Vineet Gupta wrote:
>>>>> On 10/24/23 13:36, rep.dot@gmail.com wrote:
>>>>>>>>>> As said, I don't see why the below was not cleaned up before the V1 
>>>>>>>>>> submission.
>>>>>>>>>> Iff it breaks when manually CSEing, I'm curious why?
>>>>>>>> The function below looks identical in v12 of the patch.
>>>>>>>> Why didn't you use common subexpressions?
>>>>>>>> ba
>>>>>>> Using CSE here breaks aarch64 regressions hence I have reverted it back
>>>>>>> not to use CSE,
>>>>>> Just for my own education, can you please paste your patch perusing 
>>>>>> common subexpressions and an assembly diff of the failing versus working 
>>>>>> aarch64 testcase, along how you configured that failing 
>>>>>> (cross-?)compiler and the command-line of a typical testcase that broke 
>>>>>> when manually CSEing the function below?
>>>>> I was meaning to ask this before, but what exactly is the CSE issue, 
>>>>> manually or whatever.
>>> If nothing else it would hopefully improve the readability.
>>>
>>>>>    
>>>> Here is the abi interface where I CSE'D and got a mail from automated 
>>>> regressions run that aarch64
>>>> test fails.
>>> We already concluded that this failure was obviously a hiccup on the
>>> testers, no problem.
>>>
>>>> +static inline bool
>>>> +abi_extension_candidate_return_reg_p (int regno)
>>>> +{
>>>> +  return targetm.calls.function_value_regno_p (regno);
>>>> +}
>>> But i was referring to abi_extension_candidate_p :)
>>>
>>> your v13 looks like this:
>>>
>>> +static bool
>>> +abi_extension_candidate_p (rtx_insn *insn)
>>> +{
>>> +  rtx set = single_set (insn);
>>> +  machine_mode dst_mode = GET_MODE (SET_DEST (set));
>>> +  rtx orig_src = XEXP (SET_SRC (set), 0);
>>> +
>>> +  if (!FUNCTION_ARG_REGNO_P (REGNO (orig_src))
>>> +  || abi_extension_candidate_return_reg_p (REGNO (orig_src)))
>>> +    return false;
>>> +
>>> +  /* Return FALSE if mode of destination and source is same.  */
>>> +  if (dst_mode == GET_MODE (orig_src))
>>> +    return false;
>>> +
>>> +  machine_mode mode = GET_MODE (XEXP (SET_SRC (set), 0));
>>> +  bool promote_p = abi_target_promote_function_mode (mode);
>>> +
>>> +  /* Return FALSE if promote is false and REGNO of source and destination
>>> + is different.  */
>>> +  if (!promote_p && REGNO (SET_DEST (set)) != REGNO (orig_src))
>>> +    return false;
>>> +
>>> +  return true;
>>> +}
>>>
>>> and i suppose it would be easier to read if phrased something like
>>>
>>> static bool
>>> abi_extension_candidate_p (rtx_insn *insn)
>>> {
>>>    rtx set = single_set (insn);
>>>    rtx orig_src = XEXP (SET_SRC (set), 0);
>>>    unsigned int src_regno = REGNO (orig_src);
>>>
>>>    /* Not a function argument reg or is a function values return reg.  */
>>>    if (!FUNCTION_ARG_REGNO_P (src_regno)
>>>    || abi_extension_candidate_return_reg_p (src_regno))
>>>  return false;
>>>
>>>    rtx dst = SET_DST (set);
>>>    machine_mode src_mode = GET_MODE (orig_src);
>>>
>>>    /* Return FALSE if mode of destination and source is the same.  */
>>>    if (GET_MODE (dst) == src_mode)
>>>  return false;
>>>
>>>    /* Return FALSE if the FIX THE COMMENT and REGNO of source and 
>>> destination
>>>   is different.  */
>>>    if (!abi_target_promote_function_mode_p (src_mode)
>>>    && REGNO (dst) != src_regno)
>>>  return false;
>>>
>>>    return true;
>>> }
>>>
>>> so no, that's not exactly better.
>>>
>>> Maybe just do what the function comment says (i did not check the "not
>>> promoted" part, but you get the idea):
>>>
>>>

[PATCH V11] : tree-ssa-sink: Improve code sinking pass

2023-10-30 Thread Ajit Agarwal
Hello Richard:

Currently, code sinking will sink code at the use points with loop having same
nesting depth. The following patch improves code sinking by placing the sunk
code in immediate dominator with same loop nest depth.

Review comments are incorporated.

For example :

void bar();
int j;
void foo(int a, int b, int c, int d, int e, int f)
{
  int l;
  l = a + b + c + d +e + f;
  if (a != 5)
{
  bar();
  j = l;
}
}

Code Sinking does the following:

void bar();
int j;
void foo(int a, int b, int c, int d, int e, int f)
{
  int l;

  if (a != 5)
{
  l = a + b + c + d +e + f;
  bar();
  j = l;
}
}

Bootstrapped regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit


tree-ssa-sink: Improve code sinking pass

Currently, code sinking will sink code at the use points with loop having same
nesting depth. The following patch improves code sinking by placing the sunk
code in immediate dominator with same loop nest depth.

2023-10-30  Ajit Kumar Agarwal  

gcc/ChangeLog:

PR tree-optimization/81953
* tree-ssa-sink.cc (statement_sink_location): Move statements with
same loop nest depth.
(select_best_block): Add heuristics to select the best blocks in the
immediate dominato for same loop nest depthr.

gcc/testsuite/ChangeLog:

PR tree-optimization/81953
* gcc.dg/tree-ssa/ssa-sink-21.c: New test.
* gcc.dg/tree-ssa/ssa-sink-22.c: New test.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c | 15 +++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c | 19 +++
 gcc/tree-ssa-sink.cc| 21 ++---
 3 files changed, 48 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
new file mode 100644
index 000..d3b79ca5803
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+void bar();
+int j;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump 
{l_12\s+=\s+_4\s+\+\s+f_11\(D\);\n\s+bar\s+\(\)} sink1 } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
new file mode 100644
index 000..84e7938c54f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+void bar();
+int j, x;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  if (b != 3)
+x = 3;
+  else
+x = 5;
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump 
{l_13\s+=\s+_4\s+\+\s+f_12\(D\);\n\s+bar\s+\(\)} sink1 } } */
diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
index a360c5cdd6e..0b823b81309 100644
--- a/gcc/tree-ssa-sink.cc
+++ b/gcc/tree-ssa-sink.cc
@@ -176,6 +176,9 @@ nearest_common_dominator_of_uses (def_operand_p def_p, bool 
*debug_stmts)
tree, return the best basic block between them (inclusive) to place
statements.
 
+   The best basic block should be an immediate dominator of
+   best basic block if we've moved to same loop nest.
+
We want the most control dependent block in the shallowest loop nest.
 
If the resulting block is in a shallower loop nest, then use it.  Else
@@ -201,14 +204,13 @@ select_best_block (basic_block early_bb,
 {
   /* If we've moved into a lower loop nest, then that becomes
 our best block.  */
-  if (bb_loop_depth (temp_bb) < bb_loop_depth (best_bb))
+  if (bb_loop_depth (temp_bb) <= bb_loop_depth (best_bb))
best_bb = temp_bb;
 
   /* Walk up the dominator tree, hopefully we'll find a shallower
 loop nest.  */
   temp_bb = get_immediate_dominator (CDI_DOMINATORS, temp_bb);
 }
-
   /* Placing a statement before a setjmp-like function would be invalid
  (it cannot be reevaluated when execution follows an abnormal edge).
  If we selected a block with abnormal predecessors, just punt.  */
@@ -250,7 +252,14 @@ select_best_block (basic_block early_bb,
   /* If result of comparsion is unknown, prefer EARLY_BB.
 Thus use !(...>=..) rather than (...<...)  */
   && !(best_bb->count * 100 >= early_bb->count * threshold))
-return best_bb;
+{
+ /* Avoid sinking to immediate dominator if the statement to be moved
+has memory operand and same loop nest.  */
+  if (best_bb != late_bb && gimple_vuse (stmt))
+   return late_bb;
+
+  return best_bb;
+}
 
   /* No better block found, so return EARLY_BB, which happens to be the
  statement's or

Re: [PATCH v8] tree-ssa-sink: Improve code sinking pass

2023-10-30 Thread Ajit Agarwal
Hello Richard:

On 17/10/23 2:47 pm, Richard Biener wrote:
> On Tue, Oct 17, 2023 at 10:53 AM Ajit Agarwal  wrote:
>>
>> Hello Richard:
>>
>> On 17/10/23 2:03 pm, Richard Biener wrote:
>>> On Thu, Oct 12, 2023 at 10:42 AM Ajit Agarwal  
>>> wrote:
>>>>
>>>> This patch improves code sinking pass to sink statements before call to 
>>>> reduce
>>>> register pressure.
>>>> Review comments are incorporated. Synced and modified with latest trunk 
>>>> sources.
>>>>
>>>> For example :
>>>>
>>>> void bar();
>>>> int j;
>>>> void foo(int a, int b, int c, int d, int e, int f)
>>>> {
>>>>   int l;
>>>>   l = a + b + c + d +e + f;
>>>>   if (a != 5)
>>>> {
>>>>   bar();
>>>>   j = l;
>>>> }
>>>> }
>>>>
>>>> Code Sinking does the following:
>>>>
>>>> void bar();
>>>> int j;
>>>> void foo(int a, int b, int c, int d, int e, int f)
>>>> {
>>>>   int l;
>>>>
>>>>   if (a != 5)
>>>> {
>>>>   l = a + b + c + d +e + f;
>>>>   bar();
>>>>   j = l;
>>>> }
>>>> }
>>>>
>>>> Bootstrapped regtested on powerpc64-linux-gnu.
>>>>
>>>> Thanks & Regards
>>>> Ajit
>>>>
>>>> tree-ssa-sink: Improve code sinking pass
>>>>
>>>> Currently, code sinking will sink code after function calls.  This 
>>>> increases
>>>> register pressure for callee-saved registers.  The following patch improves
>>>> code sinking by placing the sunk code before calls in the use block or in
>>>> the immediate dominator of the use blocks.
>>>
>>> The patch no longer does what the description above says.
>> Why you think so. Please let me know.
> 
> You talk about calls above but the patch doesn't do anything about calls.  You
> also don't do anything about register pressure, rather the effect of
> your changes
> are to move some stmts by a smaller "distance", whatever effect that has.
> 
>>>

I have incorporated the changes in version 11 of the patch.
>>> More comments below.
>>>
>>>> 2023-10-12  Ajit Kumar Agarwal  
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>> PR tree-optimization/81953
>>>> * tree-ssa-sink.cc (statement_sink_location): Move statements 
>>>> before
>>>> calls.
>>>> (select_best_block): Add heuristics to select the best blocks in 
>>>> the
>>>> immediate post dominator.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>> PR tree-optimization/81953
>>>> * gcc.dg/tree-ssa/ssa-sink-20.c: New test.
>>>> * gcc.dg/tree-ssa/ssa-sink-21.c: New test.
>>>> ---
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c | 15 
>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c | 19 ++
>>>>  gcc/tree-ssa-sink.cc| 39 -
>>>>  3 files changed, 56 insertions(+), 17 deletions(-)
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
>>>>
>>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c 
>>>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
>>>> new file mode 100644
>>>> index 000..d3b79ca5803
>>>> --- /dev/null
>>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
>>>> @@ -0,0 +1,15 @@
>>>> +/* { dg-do compile } */
>>>> +/* { dg-options "-O2 -fdump-tree-sink-stats" } */
>>>> +void bar();
>>>> +int j;
>>>> +void foo(int a, int b, int c, int d, int e, int f)
>>>> +{
>>>> +  int l;
>>>> +  l = a + b + c + d +e + f;
>>>> +  if (a != 5)
>>>> +{
>>>> +  bar();
>>>> +  j = l;
>>>> +}
>>>> +}
>>>> +/* { dg-final { scan-tree-dump 
>>>> {l_12\s+=\s+_4\s+\+\s+f_11\(D\);\n\s+bar\s+\(\)} sink1 } } */
>>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c 
>>>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c

Re: [PATCH v8] tree-ssa-sink: Improve code sinking pass

2023-10-30 Thread Ajit Agarwal



On 30/10/23 5:51 pm, Ajit Agarwal wrote:
> Hello Richard:
> 
> On 17/10/23 2:47 pm, Richard Biener wrote:
>> On Tue, Oct 17, 2023 at 10:53 AM Ajit Agarwal  wrote:
>>>
>>> Hello Richard:
>>>
>>> On 17/10/23 2:03 pm, Richard Biener wrote:
>>>> On Thu, Oct 12, 2023 at 10:42 AM Ajit Agarwal  
>>>> wrote:
>>>>>
>>>>> This patch improves code sinking pass to sink statements before call to 
>>>>> reduce
>>>>> register pressure.
>>>>> Review comments are incorporated. Synced and modified with latest trunk 
>>>>> sources.
>>>>>
>>>>> For example :
>>>>>
>>>>> void bar();
>>>>> int j;
>>>>> void foo(int a, int b, int c, int d, int e, int f)
>>>>> {
>>>>>   int l;
>>>>>   l = a + b + c + d +e + f;
>>>>>   if (a != 5)
>>>>> {
>>>>>   bar();
>>>>>   j = l;
>>>>> }
>>>>> }
>>>>>
>>>>> Code Sinking does the following:
>>>>>
>>>>> void bar();
>>>>> int j;
>>>>> void foo(int a, int b, int c, int d, int e, int f)
>>>>> {
>>>>>   int l;
>>>>>
>>>>>   if (a != 5)
>>>>> {
>>>>>   l = a + b + c + d +e + f;
>>>>>   bar();
>>>>>   j = l;
>>>>> }
>>>>> }
>>>>>
>>>>> Bootstrapped regtested on powerpc64-linux-gnu.
>>>>>
>>>>> Thanks & Regards
>>>>> Ajit
>>>>>
>>>>> tree-ssa-sink: Improve code sinking pass
>>>>>
>>>>> Currently, code sinking will sink code after function calls.  This 
>>>>> increases
>>>>> register pressure for callee-saved registers.  The following patch 
>>>>> improves
>>>>> code sinking by placing the sunk code before calls in the use block or in
>>>>> the immediate dominator of the use blocks.
>>>>
>>>> The patch no longer does what the description above says.
>>> Why you think so. Please let me know.
>>
>> You talk about calls above but the patch doesn't do anything about calls.  
>> You
>> also don't do anything about register pressure, rather the effect of
>> your changes
>> are to move some stmts by a smaller "distance", whatever effect that has.
>>
>>>>
> 
> I have incorporated the changes in version 11 of the patch.
>>>> More comments below.
>>>>
>>>>> 2023-10-12  Ajit Kumar Agarwal  
>>>>>
>>>>> gcc/ChangeLog:
>>>>>
>>>>> PR tree-optimization/81953
>>>>> * tree-ssa-sink.cc (statement_sink_location): Move statements 
>>>>> before
>>>>> calls.
>>>>> (select_best_block): Add heuristics to select the best blocks in 
>>>>> the
>>>>> immediate post dominator.
>>>>>
>>>>> gcc/testsuite/ChangeLog:
>>>>>
>>>>> PR tree-optimization/81953
>>>>> * gcc.dg/tree-ssa/ssa-sink-20.c: New test.
>>>>> * gcc.dg/tree-ssa/ssa-sink-21.c: New test.
>>>>> ---
>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c | 15 
>>>>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c | 19 ++
>>>>>  gcc/tree-ssa-sink.cc| 39 -
>>>>>  3 files changed, 56 insertions(+), 17 deletions(-)
>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
>>>>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
>>>>>
>>>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c 
>>>>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
>>>>> new file mode 100644
>>>>> index 000..d3b79ca5803
>>>>> --- /dev/null
>>>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
>>>>> @@ -0,0 +1,15 @@
>>>>> +/* { dg-do compile } */
>>>>> +/* { dg-options "-O2 -fdump-tree-sink-stats" } */
>>>>> +void bar();
>>>>> +int j;
>>>>> +void foo(int a, int b, int c, int d, int e, 

[PATCH] tree-optimization: Add register pressure heuristics

2023-11-02 Thread Ajit Agarwal
Hello All:

Currently code sinking heuristics are based on profile data like
basic block count and sink frequency threshold. We have removed
such heuristics and added register pressure heuristics based on
live-in and live-out of early blocks and immediate dominator of
use blocks of the same loop nesting depth.

Such heuristics reduces register pressure when code sinking is 
done with same loop nesting depth.

High register pressure region is the region where there are live-in of
early blocks that has been modified by the early block. If there are
modification of the variables in best block that are live-in in early
block that are live-out of best block.

Bootstrapped and regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit

tree-optimization: Add register pressure heuristics

Currently code sinking heuristics are based on profile data like
basic block count and sink frequency threshold. We have removed
such heuristics to add register pressure heuristics based on
live-in and live-out of early blocks and immediate dominator of
use blocks.

High register pressure region is the region where there are live-in of
early blocks that has been modified by the early block. If there are
modification of the variables in best block that are live-in in early
block that are live-out of best block.

2023-11-03  Ajit Kumar Agarwal  

gcc/ChangeLog:

* tree-ssa-sink.cc (statement_sink_location): Add tree_live_info_p
as paramters.
(sink_code_in_bb): Ditto.
(select_best_block): Add register pressure heuristics to select
the best blocks in the immediate dominator for same loop nest depth.
(execute): Add live range analysis.
(additional_var_map): New function.
* tree-ssa-live.cc (set_var_live_on_entry): Add virtual operand
tests on ssa_names.
(verify_live_on_entry): Ditto.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/ssa-sink-21.c: New test.
* gcc.dg/tree-ssa/ssa-sink-22.c: New test.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c | 15 
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c | 19 +
 gcc/tree-ssa-live.cc| 11 ++-
 gcc/tree-ssa-sink.cc| 93 ++---
 4 files changed, 104 insertions(+), 34 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
new file mode 100644
index 000..d3b79ca5803
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+void bar();
+int j;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump 
{l_12\s+=\s+_4\s+\+\s+f_11\(D\);\n\s+bar\s+\(\)} sink1 } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
new file mode 100644
index 000..84e7938c54f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+void bar();
+int j, x;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  if (b != 3)
+x = 3;
+  else
+x = 5;
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump 
{l_13\s+=\s+_4\s+\+\s+f_12\(D\);\n\s+bar\s+\(\)} sink1 } } */
diff --git a/gcc/tree-ssa-live.cc b/gcc/tree-ssa-live.cc
index f06daf23035..998fe588278 100644
--- a/gcc/tree-ssa-live.cc
+++ b/gcc/tree-ssa-live.cc
@@ -1141,7 +1141,8 @@ set_var_live_on_entry (tree ssa_name, tree_live_info_p 
live)
 def_bb = ENTRY_BLOCK_PTR_FOR_FN (cfun);
 
   /* An undefined local variable does not need to be very alive.  */
-  if (ssa_undefined_value_p (ssa_name, false))
+  if (virtual_operand_p (ssa_name)
+  || ssa_undefined_value_p (ssa_name, false))
 return;
 
   /* Visit each use of SSA_NAME and if it isn't in the same block as the def,
@@ -1540,7 +1541,6 @@ debug (tree_live_info_d *ptr)
 
 
 /* Verify that the info in LIVE matches the current cfg.  */
-
 static void
 verify_live_on_entry (tree_live_info_p live)
 {
@@ -1569,11 +1569,13 @@ verify_live_on_entry (tree_live_info_p live)
  tree d = NULL_TREE;
  bitmap loe;
  var = partition_to_var (map, i);
+ if (var == NULL_TREE)
+   continue;
  stmt = SSA_NAME_DEF_STMT (var);
  tmp = gimple_bb (stmt);
+
  if (SSA_NAME_VAR (var))
d = ssa_default_def (cfun, SSA_NAME_VAR (var));
-
  loe = live_on_entry (live, e->dest);
  if (loe && bitmap_bit_p (loe, i))
{
@@ -1614,7 +1616,8 @@ verify_live_on_entry (tree_live_info_p live)
  {
 

Re: [PATCH] tree-optimization: Add register pressure heuristics

2023-11-03 Thread Ajit Agarwal
Hello Richard:

On 03/11/23 12:51 pm, Richard Biener wrote:
> On Thu, Nov 2, 2023 at 9:50 PM Ajit Agarwal  wrote:
>>
>> Hello All:
>>
>> Currently code sinking heuristics are based on profile data like
>> basic block count and sink frequency threshold. We have removed
>> such heuristics and added register pressure heuristics based on
>> live-in and live-out of early blocks and immediate dominator of
>> use blocks of the same loop nesting depth.
>>
>> Such heuristics reduces register pressure when code sinking is
>> done with same loop nesting depth.
>>
>> High register pressure region is the region where there are live-in of
>> early blocks that has been modified by the early block. If there are
>> modification of the variables in best block that are live-in in early
>> block that are live-out of best block.
> 
> ?!  Parse error.
> 

I didnt understand what you meant here. Please suggest.

>> Bootstrapped and regtested on powerpc64-linux-gnu.
> 
> What's the effect on code generation?
> 
> Note that live is a quadratic problem while sinking was not.  You
> are effectively making the pass unfit for -O1.
> 
> You are computing "liveness" on GIMPLE where within EBBs there
> isn't really any particular order of stmts, so it's kind of a garbage
> heuristic.  Likewise you are not computing the effect that moving
> a stmt has on liveness as far as I can see but you are just identifying
> some odd metrics (I don't really understand them) to rank blocks,
> not even taking the register file size into account.


if the live out of best_bb  <= live out of early_bb, that shows
that there are modification in best_bb. Then it's 
safer to move statements in best_bb as there are lesser interfering
live variables in best_bb.

if there are lesser live out in best_bb, there is lesser chance 
of interfering live ranges and hence moving statements in best_bb
will not increase register pressure.

If the liveout of best_bb is greater than live-out of early_bb, 
moving statements in best_bb will increase chances of more interfering
live ranges and hence increase in register pressure.

This is how the heuristics is defined.


> 
> You are replacing the hot/cold heuristic.

> 
> IMHO the sinking pass is the totally wrong place to do anything
> about register pressure.  You are trying to solve a scheduling
> problem by just looking at a single stmt.
> 

bb->count from profile.cc are prone to errors as you have 
mentioned in previous mails. Main bottlenecks with code 
motion is increase in register pressure as that counts to 
spills in later phases of the compiler backend.

Calculation of best_bb based of immediate dominator should
consider register pressure instead of hold cold regions as that
would effect code generation.

If there is increase in register pressure with code motion and if
we are moving into colder regions, that wont improve code generations.

Hold/cold should be criteria but not the improved criteria with 
code motion.

We should consider register pressure with code motion than hot/cold
regions.

Thanks & Regards
Ajit

> Richard.
> 
>> Thanks & Regards


>> Ajit
>>
>> tree-optimization: Add register pressure heuristics
>>
>> Currently code sinking heuristics are based on profile data like
>> basic block count and sink frequency threshold. We have removed
>> such heuristics to add register pressure heuristics based on
>> live-in and live-out of early blocks and immediate dominator of
>> use blocks.
>>
>> High register pressure region is the region where there are live-in of
>> early blocks that has been modified by the early block. If there are
>> modification of the variables in best block that are live-in in early
>> block that are live-out of best block.
>>
>> 2023-11-03  Ajit Kumar Agarwal  
>>
>> gcc/ChangeLog:
>>
>> * tree-ssa-sink.cc (statement_sink_location): Add tree_live_info_p
>> as paramters.
>> (sink_code_in_bb): Ditto.
>> (select_best_block): Add register pressure heuristics to select
>> the best blocks in the immediate dominator for same loop nest depth.
>> (execute): Add live range analysis.
>> (additional_var_map): New function.
>> * tree-ssa-live.cc (set_var_live_on_entry): Add virtual operand
>> tests on ssa_names.
>> (verify_live_on_entry): Ditto.
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.dg/tree-ssa/ssa-sink-21.c: New test.
>> * gcc.dg/tree-ssa/ssa-sink-22.c: New test.
>> ---
>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c | 15 
>>  gcc/testsuite/gcc

Re: [PATCH] tree-optimization: Add register pressure heuristics

2023-11-03 Thread Ajit Agarwal
Hello Richard:


On 03/11/23 7:06 pm, Richard Biener wrote:
> On Fri, Nov 3, 2023 at 11:20 AM Ajit Agarwal  wrote:
>>
>> Hello Richard:
>>
>> On 03/11/23 12:51 pm, Richard Biener wrote:
>>> On Thu, Nov 2, 2023 at 9:50 PM Ajit Agarwal  wrote:
>>>>
>>>> Hello All:
>>>>
> [...]
>>>>
>>>> High register pressure region is the region where there are live-in of
>>>> early blocks that has been modified by the early block. If there are
>>>> modification of the variables in best block that are live-in in early
>>>> block that are live-out of best block.
>>>
>>> ?!  Parse error.
>>>
>>
>> I didnt understand what you meant here. Please suggest.
> 
> I can't even guess what that paragraph means.  It fails at a
> parsing level already, I can't even start to reason about what
> the sentences mean.

Sorry for that I will modify.

> 
>>>> Bootstrapped and regtested on powerpc64-linux-gnu.
>>>
>>> What's the effect on code generation?
>>>
>>> Note that live is a quadratic problem while sinking was not.  You
>>> are effectively making the pass unfit for -O1.
>>>
>>> You are computing "liveness" on GIMPLE where within EBBs there
>>> isn't really any particular order of stmts, so it's kind of a garbage
>>> heuristic.  Likewise you are not computing the effect that moving
>>> a stmt has on liveness as far as I can see but you are just identifying
>>> some odd metrics (I don't really understand them) to rank blocks,
>>> not even taking the register file size into account.
>>
>>
>> if the live out of best_bb  <= live out of early_bb, that shows
>> that there are modification in best_bb.
> 
> Hm?  Do you maybe want to say that if live_out (bb) < live_in (bb)
> then some variables die during the execution of bb?

live_out (bb) < live_in(bb) means in bb there may be KILL (Variables)
and there are more GEN (Variables).

  Otherwise,
> if live_out (early) > live_out (best) then somewhere on the path
> from early to best some variables die.
> 

If live_out (early) > live_out (best) means there are more GEN (Variables)
between path from early to best.


>> Then it's
>> safer to move statements in best_bb as there are lesser interfering
>> live variables in best_bb.
> 
> consider a stmt
> 
>  a = b + c;
> 
> where b and c die at the definition of a.  Then moving the stmt
> down from early_bb means you increase live_out (early_bb) by
> one.  So why's that "safer" then?  Of course live_out (best_bb)
> also increases by two then.
> 

If b and c die at the definition of a and generates a live_in(early_bb)
would be live_out(early_bb) - 2 + 1.

the moving the stmt from early_bb down to best_bb increases live_out(early_bb)
by one and live_out (best_bb) depends on the LIVEIN(for all successors of 
best_bb)
which may be same even if we move down.

There are chances that live_out (best_bb) greater if for all successors of 
best_bb there are more GEN ( variables). If live_out (best_bb) is less
means there more KILL (Variables) in successors of best_bb.

With my heuristics live_out (best_bb ) > live_out (early_bb) then we dont do
code motion as there are chances of more interfering live ranges. If 
liveout(best_bb)
<= liveout (early_bb) then we do code motion as there is there are more KILL(for
all successors of best_bb) and there is less chance of interfering live ranges.

With moving down above stmt from early_bb to best_bb increases 
live_out(early_bb)
by one but live_out(best_bb) may be remains. If live_out (early_bb) increase by 
1
but if it becomes > live_out(best_bb) then we dont do code motion if we have 
more GEN (Variables) in best_bb otherewise its safer to do 
code motion.

for above statement a = b + c dies b and c and generates a in early_bb then
liveout(early_bb) increases by 1. If before moving if liveout (best_bb) is 10
and then liveout (early_bb) becomes > 10 then we dont do code motion otherwise
we do code motion.





>> if there are lesser live out in best_bb, there is lesser chance
>> of interfering live ranges and hence moving statements in best_bb
>> will not increase register pressure.
>>
>> If the liveout of best_bb is greater than live-out of early_bb,
>> moving statements in best_bb will increase chances of more interfering
>> live ranges and hence increase in register pressure.
>>
>> This is how the heuristics is defined.
> 
> I don't think they will work.  Do you have any numbers?
>

My heuristics will work as mentioned above. I will  run the spec benchmarks
and will able 

Re: [PATCH] tree-optimization: Add register pressure heuristics

2023-11-04 Thread Ajit Agarwal
Hello Richard:

Below are the performance numbers on CPU 2017 benchmarks with and without 
register pressure
changes for code sinking.

INT Benchmarks:

With register pressure code sinking changes:


   Estimated   Estimated
 Base BaseBasePeak PeakPeak
Benchmarks   Copies  Run Time RateCopies  Run Time Rate 
--- ---  -  ----  -  -

500.perlbench_r   1363   4.39  *
502.gcc_r 1225   6.30  *
505.mcf_r 1289   5.59  *
520.omnetpp_r 1315   4.17  *
523.xalancbmk_r   1238   4.43  *
525.x264_r1180   9.75  *
531.deepsjeng_r   1291   3.94  *
541.leela_r   1463   3.58  *
548.exchange2_r   1222  11.8   *
557.xz_r  1323   3.34  *
 Est. SPECrate(R)2017_int_base   5.24

Trunk without any register pressure code sinking changes:

500.perlbench_r   1358   4.44  *
502.gcc_r 1225   6.28  *
505.mcf_r 1286   5.64  *
520.omnetpp_r 1310   4.24  *
523.xalancbmk_r   1235   4.50  *
525.x264_r1181   9.69  *
531.deepsjeng_r   1291   3.94  *
541.leela_r   1465   3.56  *
548.exchange2_r   1219  11.9   *
557.xz_r  1325   3.32  *
 Est. SPECrate(R)2017_int_base   5.26

FP benchmarks:

With register pressure code sinking changes:

503.bwaves_r  1187  53.5   * 
507.cactuBSSN_r   1235   5.39  *
508.namd_r1216   4.40  *
510.parest_r  1340   7.69  *
511.povray_r  1488   4.78  *
519.lbm_r 1128   8.21  *
521.wrf_r 1269   8.34  *
526.blender_r 1311   4.89  *
527.cam4_r1314   5.56  *
538.imagick_r 1228  10.9   *
544.nab_r 1293   5.74  *
549.fotonik3d_r   1237  16.4   *
554.roms_r1269   5.90  *
 Est. SPECrate(R)2017_fp_base7.97

Trunk Without register pressure changes for code sinking:

503.bwaves_r  1188  53.4   *
507.cactuBSSN_r   1242   5.24  *
508.namd_r1215   4.42  *
510.parest_r  1333   7.86  *
511.povray_r  1481   4.85  *
519.lbm_r 1128   8.22  *
521.wrf_r 1269   8.34  *
526.blender_r 1309   4.93  *
527.cam4_r1313   5.58  *
538.imagick_r 1227  11.0   *
544.nab_r 1291   5.79  *
549.fotonik3d_r   1235  16.6   *
554.roms_r1268   5.92  *
 Est. SPECrate(R)2017_fp_base8.00


Thanks & Regards
Ajit

On 03/11/23 8:24 pm, Ajit Agarwal wrote:
> Hello Richard:
> 
> 
> On 03/11/23 7:06 pm, Richard Biener wrote:
>> On Fri, Nov 3, 2023 at 11:20 AM Ajit Agarwal  wrote:
>>>
>>> Hello Richard:
>>>
>>> On 03/11/23 12:51 pm, Richard Biener wrote:
>>>> On Thu, Nov 2, 2023 at 9:50 PM Ajit Agarwal  wrote:
>>>>>
>>>>> Hello All:
>>>>>
>> [...]
>>>>>
>>>>> High register pressure region is the region where there are live-in of
>>>>> early blocks that has been modified by the early block. If there are
>>>>> modification of the variables in best block that are live-in in early
>>>>> block that are live-out of best block.
>>>>
>>>> ?!  Parse error.
>>>>
>>>
>>> I didnt understand what you meant here. Please suggest.
>>
>> I can't even guess what that paragraph means.  It fails at a
>> parsing level already, I can't even start to reason about what
>> the sentences mean.
> 
> Sorry for that I will modify.
> 
>>
>>>>> Bootstrapped and regtested on powerpc64-linux-gnu.
>>>>
>>>> What's the effect on code generation?
>>>>
>>>> Note that live is a quadratic problem while sinking was not.  You
>>>> are effectively making the pass unfit for -O1.
>>>>
>>>> You are computing "liveness" on GIMPLE where within EBBs there
>>>> isn't really any particular order of stmts, so it's kind of a garbage
>>>> heuristic.  Likewise you are not computing the effect that moving
>>>> a stmt has on liveness as fa

Re: [PATCH V3 2/2] rs6000: Load store fusion for rs6000 target using common infrastructure

2024-03-07 Thread Ajit Agarwal
Hello Segher:

On 01/03/24 3:02 am, Segher Boessenkool wrote:
> Hi!
> 
> On Mon, Feb 19, 2024 at 04:24:37PM +0530, Ajit Agarwal wrote:
>> --- a/gcc/config.gcc
>> +++ b/gcc/config.gcc
>> @@ -518,7 +518,7 @@ or1k*-*-*)
>>  ;;
>>  powerpc*-*-*)
>>  cpu_type=rs6000
>> -extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
>> +extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
>> rs6000-vecload-fusion.o"
> 
> Line too long.

I will incorporate this change.
> 
>> +  /* Pass to replace adjacent memory addresses lxv instruction with lxvp
>> + instruction.  */
>> +  INSERT_PASS_BEFORE (pass_early_remat, 1, pass_analyze_vecload);
> 
> That is not such a great name.  Any pss name with "analyze" is not so
> good -- the pass does much more than just "analyze" things!
> 

I will change that and incorporate that.

>> --- /dev/null
>> +++ b/gcc/config/rs6000/rs6000-vecload-fusion.cc
>> @@ -0,0 +1,701 @@
>> +/* Subroutines used to replace lxv with lxvp
>> +   for TARGET_POWER10 and TARGET_VSX,
> 
> The pass filename is not good then, either.
>

I will change and incorporate it.
 
>> +   Copyright (C) 2020-2023 Free Software Foundation, Inc.
> 
> What in here is from 2020?
> 
> Most things will be from 2024, too.  First publication date is what
> counts.

Please let me know the second publication date.

> 
>> +   Contributed by Ajit Kumar Agarwal .
> 
> We don't say such things in the files normally.
>

Yes I will remove it.
 
>> +class rs6000_pair_fusion : public pair_fusion
>> +{
>> +public:
>> +  rs6000_pair_fusion (bb_info *bb) : pair_fusion (bb) {reg_ops = NULL;};
>> +  bool is_fpsimd_op_p (rtx reg_op, machine_mode mem_mode, bool load_p);
>> +  bool pair_mem_ok_policy (rtx first_mem, bool load_p, machine_mode mode)
>> +  {
>> +return !(first_mem || load_p || mode);
>> +  }
> 
> It is much more natural to write this as
>   retuurn !first_mem && !load && !mode;
> 
> (_p is wrong, this is not a predicate, it is not a function at all!)
> 

Surely I will do that.

> What is "!mode" for here?  How can VOIDmode happen here?  What does it
> mean?  This needs to be documented.
>
Yes I will document that.

 
>> +  bool pair_check_register_operand (bool load_p, rtx reg_op,
>> +machine_mode mem_mode)
>> +  {
>> +if (load_p || reg_op || mem_mode)
>> +  return false;
>> +else
>> +  return false;
>> +  }
> 
> The compiler will have warned for this.  Please look at all compiler
> (and other) warnings that you introduce.
>

As far as my understanding I didn't see any extra warnings, 
but I will surely cross check and solve that.
 
>> +rs6000_pair_fusion::is_fpsimd_op_p (rtx reg_op, machine_mode mem_mode, bool 
>> load_p)
>> +{
>> +  return !((reg_op && mem_mode) || load_p);
>> +}
> 
> For more complex logic, split it up into two or more conditional
> returns.
> 

Surely I will do that.

>> +// alias_walker that iterates over stores.
>> +template
>> +class store_walker : public def_walker
> 
> That is not a good comment.  You should describe parameters and return
> values and that kind of thing.  That it walks over things is bloody
> obvious from the name already :-)
>

This part of code is taken from aarch64 load store fusion
pass.  I have made the aarch64-ldp-fusion.cc into target independent code and 
target dependent code. Target independent code is shared
across all the architecture, In this case its rs6000 and aarch64.
Target dependent code is implemented through pure virtual functions.

While doing this, I have not changed target independent code
taken as it is from aarch64-ldp-fusion.cc 

This is how they have added comments.

Target dependent code is based in rs6000-vecload-fusion.cc and 
aarch64-ldp-fusion.cc.

Target independent code is populated in 3 files.

gcc/pair-fusion-base.h

This file has declaration of pair_fusion base class and 
other classes declarations along with prototype of common
fusions in gcc/pair-fusion-common.cc

gcc/pair-fusion-common.cc

Here we have common function that is shared across all 
architectures that are helper functions that are used
inside pair_fusion class member functions.
 
gcc/pair-fusion.cc

These are implementation of member function of pair_fusion
class.

Architecture dependent files are 
rs6000-vecload-fusion-pass/aarch64-ldp-fusion.cc. This has implementation of 
derived classes from pair_fusion classes 
and target specific code added to it.

>> +extern insn_info *
>> +find_trailing_add (insn_info *insn

[PATCH V12]: Improve code sinking pass

2024-03-13 Thread Ajit Agarwal
Hello All:

Currently, code sinking will sink code at the use points with loop having same
nesting depth. The following patch improves code sinking by placing the sunk
code in immediate dominator with same loop nest depth.

Changes since v11:

Reorganization of the code.

For example :

void bar();
int j;
void foo(int a, int b, int c, int d, int e, int f)
{
  int l;
  l = a + b + c + d +e + f;
  if (a != 5)
{
  bar();
  j = l;
}
}

Code Sinking does the following:

void bar();
int j;
void foo(int a, int b, int c, int d, int e, int f)
{
  int l;

  if (a != 5)
{
  l = a + b + c + d +e + f;
  bar();
  j = l;
}
}

Bootstrapped regtested on powerpc64-linux-gnu.

Thanks & Regards


tree-ssa-sink: Improve code sinking pass

Currently, code sinking will sink code at the use points with loop having same
nesting depth. The following patch improves code sinking by placing the sunk
code in immediate dominator with same loop nest depth.

2024-03-13  Ajit Kumar Agarwal  

gcc/ChangeLog:

PR tree-optimization/81953
* tree-ssa-sink.cc (statement_sink_location): Move statements with
same loop nest depth.
(select_best_block): Add heuristics to select the best blocks in the
immediate dominator for same loop nest depth.

gcc/testsuite/ChangeLog:

PR tree-optimization/81953
* gcc.dg/tree-ssa/ssa-sink-21.c: New test.
* gcc.dg/tree-ssa/ssa-sink-22.c: New test.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c | 15 
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c | 19 +++
 gcc/tree-ssa-sink.cc| 26 +
 3 files changed, 55 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
new file mode 100644
index 000..d3b79ca5803
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+void bar();
+int j;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump 
{l_12\s+=\s+_4\s+\+\s+f_11\(D\);\n\s+bar\s+\(\)} sink1 } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
new file mode 100644
index 000..84e7938c54f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-22.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+void bar();
+int j, x;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  if (b != 3)
+x = 3;
+  else
+x = 5;
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump 
{l_13\s+=\s+_4\s+\+\s+f_12\(D\);\n\s+bar\s+\(\)} sink1 } } */
diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
index 880d6f70a80..40f51e2f3b9 100644
--- a/gcc/tree-ssa-sink.cc
+++ b/gcc/tree-ssa-sink.cc
@@ -176,6 +176,9 @@ nearest_common_dominator_of_uses (def_operand_p def_p, bool 
*debug_stmts)
tree, return the best basic block between them (inclusive) to place
statements.
 
+   The best basic block should be an immediate dominator of
+   best basic block if we've moved to same loop nest.
+
We want the most control dependent block in the shallowest loop nest.
 
If the resulting block is in a shallower loop nest, then use it.  Else
@@ -209,6 +212,21 @@ select_best_block (basic_block early_bb,
   temp_bb = get_immediate_dominator (CDI_DOMINATORS, temp_bb);
 }
 
+  temp_bb = best_bb;
+  /* Move sinking to immediate dominator if the statement to be moved
+ is not memory operand and same loop nest.  */
+  if (best_bb == late_bb
+  && !gimple_vuse (stmt))
+{
+  while (temp_bb != early_bb)
+   {
+ if (bb_loop_depth (temp_bb) == bb_loop_depth (best_bb))
+   best_bb = temp_bb;
+
+ temp_bb = get_immediate_dominator (CDI_DOMINATORS, temp_bb);
+   }
+ }
+
   /* Placing a statement before a setjmp-like function would be invalid
  (it cannot be reevaluated when execution follows an abnormal edge).
  If we selected a block with abnormal predecessors, just punt.  */
@@ -250,7 +268,7 @@ select_best_block (basic_block early_bb,
   /* If result of comparsion is unknown, prefer EARLY_BB.
 Thus use !(...>=..) rather than (...<...)  */
   && !(best_bb->count * 100 >= early_bb->count * threshold))
-return best_bb;
+ return best_bb;
 
   /* No better block found, so return EARLY_BB, which happens to be the
  statement's original block.  */
@@ -430,6 +448,7 @@ statement_sink_location (gimple *stmt, basic_block frombb,
continue;
  brea

[PATCH V3 3/4] ree: Improve ree pass.

2024-03-13 Thread Ajit Agarwal
Hello All:

For rs6000 target we see redundant zero and sign extension and done to improve
ree pass to eliminate such redundant zero and sign extension. Support of
zero_extend/sign_extend/AND. Also support of AND with extension with different
constants like 0x7/0x7F/0x7 other than 1.

Changes since v2:

- Added all constants 0x7/0x7F/0x7 other than 1 for machine modes.
- Improving coding conventions.
- Reorganization of the code.

Bootstrapped and regtested for powerpc64-linux-gnu.

contrib/check_GNU_stype.sh looks good.

spec 2017 INT and FP benchmarks runs looks good.

Thanks & Regards
Ajit


ree: Improve ree pass for rs6000 target

For rs6000 target we see redundant zero and sign extension and done to improve
ree pass to eliminate such redundant zero and sign extension. Support of
zero_extend/sign_extend/AND. Also support of AND with extension with different
constants like 0x7/0x7F/0x7 other than 1.

2024-03-13  Ajit Kumar Agarwal  

gcc/ChangeLog:

* ree.cc (eliminate_across_bbs_p): Add checks to enable extension
elimination across and within basic blocks.
(def_arith_p): New function to check definition has arithmetic
operation.
(combine_set_extension): Modification to incorporate AND
and current zero_extend and sign_extend instruction.
(merge_def_and_ext): Add calls to eliminate_across_bbs_p and
zero_extend sign_extend and AND instruction.
(rtx_is_zext_p): New function.
(feasible_cfg): New function.
* rtl.h (reg_used_set_between_p): Add prototype.
* rtlanal.cc (reg_used_set_between_p): New function.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/zext-elim.C: New testcase.
* g++.target/powerpc/zext-elim-1.C: New testcase.
* g++.target/powerpc/zext-elim-2.C: New testcase.
* g++.target/powerpc/sext-elim.C: New testcase.
---
Changes since v2:

- Added all constants 0x7/0x7F/0x7 other than 1 for machine modes.
- Improving coding conventions.
- Reorganization of the code.
---
 gcc/ree.cc| 517 --
 gcc/rtl.h |   1 +
 gcc/rtlanal.cc|  15 +
 gcc/testsuite/g++.target/powerpc/sext-elim.C  |  16 +
 .../g++.target/powerpc/zext-elim-1.C  |  18 +
 .../g++.target/powerpc/zext-elim-2.C  |  10 +
 gcc/testsuite/g++.target/powerpc/zext-elim.C  |  29 +
 7 files changed, 558 insertions(+), 48 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/powerpc/sext-elim.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-1.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim-2.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/zext-elim.C

diff --git a/gcc/ree.cc b/gcc/ree.cc
index bfc4b4b0412..43fed62d755 100644
--- a/gcc/ree.cc
+++ b/gcc/ree.cc
@@ -253,6 +253,77 @@ struct ext_cand
 
 static int max_insn_uid;
 
+/* Return TRUE if OP can be considered a zero extension from one or
+   more sub-word modes to larger modes up to a full word.
+
+   For example (and:DI (reg) (const_int X))
+
+   Depending on the value of X could be considered a zero extension
+   from QI, HI and SI to larger modes up to DImode.  */
+
+static bool
+rtx_is_zext_p (rtx insn)
+{
+  if (GET_CODE (insn) == AND)
+{
+  rtx set = XEXP (insn, 0);
+  if (REG_P (set))
+   {
+ rtx src = XEXP (insn, 1);
+ machine_mode m_mode = GET_MODE (set);
+
+ if (CONST_INT_P (src)
+ && (INTVAL (src) == 1
+ || (m_mode == QImode && INTVAL (src) == 0x7)
+ || (m_mode == QImode && INTVAL (src) == 0x007F)
+ || (m_mode == HImode && INTVAL (src) == 0x7FFF)
+ || (m_mode == SImode && INTVAL (src) == 0x007F)))
+   return true;
+
+   }
+  else
+   return false;
+}
+
+  return false;
+}
+/* Return TRUE if OP can be considered a zero extension from one or
+   more sub-word modes to larger modes up to a full word.
+
+   For example (and:DI (reg) (const_int X))
+
+   Depending on the value of X could be considered a zero extension
+   from QI, HI and SI to larger modes up to DImode.  */
+
+static bool
+rtx_is_zext_p (rtx_insn *insn)
+{
+  rtx body = single_set (insn);
+
+  if (GET_CODE (body) == SET && GET_CODE (SET_SRC (body)) == AND)
+   {
+ rtx set = XEXP (SET_SRC (body), 0);
+
+ if (REG_P (set) && GET_MODE (SET_DEST (body)) == GET_MODE (set))
+   {
+ rtx src = XEXP (SET_SRC (body), 1);
+ machine_mode m_mode = GET_MODE (set);
+
+ if (CONST_INT_P (src)
+ && (INTVAL (src) == 1
+ || (m_mode == QImode && INTVAL (src) == 0x7)
+ || (m_mode == QImode && INTVAL (src) == 0x007F)
+ || (m_mode == HImode && INTVAL (src) == 0x7FFF)
+ || (m_mode == SImode && INTVAL (src) == 0x007F)))
+   return true;
+   }
+

[PATCH] tree-ssa-sink: Improve code sinking pass

2024-03-13 Thread Ajit Agarwal
Hello Richard:

Currently, code sinking will sink code at the use points with loop having same
nesting depth. The following patch improves code sinking by placing the sunk
code in begining of the block after the labels.

For example :

void bar();
int j;
void foo(int a, int b, int c, int d, int e, int f)
{
  int l;
  l = a + b + c + d +e + f;
  if (a != 5)
{
  bar();
  j = l;
}
}

Code Sinking does the following:

void bar();
int j;
void foo(int a, int b, int c, int d, int e, int f)
{
  int l;

  if (a != 5)
{
  l = a + b + c + d +e + f;
  bar();
  j = l;
}
}

Bootstrapped regtested on powerpc64-linux-gnu.

Thanks & Regards

tree-ssa-sink: Improve code sinking pass

Currently, code sinking will sink code at the use points with loop having same
nesting depth. The following patch improves code sinking by placing the sunk
code in begining of the block after the labels.

2024-03-13  Ajit Kumar Agarwal  

gcc/ChangeLog:

PR tree-optimization/81953
* tree-ssa-sink.cc (statement_sink_location):Sink statements at
the begining of the basic block after labels.

gcc/testsuite/ChangeLog:

PR tree-optimization/81953
* gcc.dg/tree-ssa/ssa-sink-21.c: New test.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c | 15 +++
 gcc/tree-ssa-sink.cc|  7 ++-
 2 files changed, 17 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
new file mode 100644
index 000..d3b79ca5803
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+void bar();
+int j;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump 
{l_12\s+=\s+_4\s+\+\s+f_11\(D\);\n\s+bar\s+\(\)} sink1 } } */
diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
index 880d6f70a80..1ec5c048fe7 100644
--- a/gcc/tree-ssa-sink.cc
+++ b/gcc/tree-ssa-sink.cc
@@ -208,7 +208,6 @@ select_best_block (basic_block early_bb,
 loop nest.  */
   temp_bb = get_immediate_dominator (CDI_DOMINATORS, temp_bb);
 }
-
   /* Placing a statement before a setjmp-like function would be invalid
  (it cannot be reevaluated when execution follows an abnormal edge).
  If we selected a block with abnormal predecessors, just punt.  */
@@ -430,6 +429,7 @@ statement_sink_location (gimple *stmt, basic_block frombb,
continue;
  break;
}
+
   use = USE_STMT (one_use);
 
   if (gimple_code (use) != GIMPLE_PHI)
@@ -439,10 +439,7 @@ statement_sink_location (gimple *stmt, basic_block frombb,
  if (sinkbb == frombb)
return false;
 
- if (sinkbb == gimple_bb (use))
-   *togsi = gsi_for_stmt (use);
- else
-   *togsi = gsi_after_labels (sinkbb);
+ *togsi = gsi_after_labels (sinkbb);
 
  return true;
}
-- 
2.39.3



[PATCH V1 0/1] rs6000: Load store fusion for rs6000 target using common infrastructure

2024-03-13 Thread Ajit Agarwal
Hello All:

Common infrastructure using generic code for load store fusion of rs6000
target.

This patch is split-patch 0 which uses generic code are implemented and defined
that can be used in target specific code for aarch64 and rs6000 target.

Generic code are implemeneted in gcc/pair-fusion-base.h, 
gcc/pair-fusion-common.cc
and gcc/pair-fusion.cc.

Code is implemented with pure virtual functions to interface with target
code.

Target specific code are added in rs600-mem-fusion.cc that uses generic code.

Bootstrapped and regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit

rs6000: Load store fusion for rs6000 target using common infrastructure

Common infrastructure using generic code for load store fusion of rs6000
target.

Generic code are implemented and defined  that can be used in target specific
code for aarch64 and rs6000 target.

Generic code are implemeneted in gcc/pair-fusion-base.h, 
gcc/pair-fusion-common.cc
and gcc/pair-fusion.cc.

Code is implemented with pure virtual functions to interface with target
code.

Target specific code are added in rs600-mem-fusion.cc that uses generic code.

2024-03-13  Ajit Kumar Agarwal  

gcc/ChangeLog:

* pair-fusion-base.h: Generic header code for load store fusion
that can be shared across different architectures.
* pair-fusion-common.cc: Generic source code for load store
fusion that can be shared across different architectures.
* pair-fusion.cc: Generic implementation of pair_fusion class
defined in pair-fusion-base.h
* Makefile.in: Add new executable pair-fusion.o and
pair-fusion-common.o.
---
 gcc/Makefile.in   |2 +
 gcc/pair-fusion-base.h|  618 +++
 gcc/pair-fusion-common.cc | 1204 
 gcc/pair-fusion.cc| 1232 +
 4 files changed, 3056 insertions(+)
 create mode 100644 gcc/pair-fusion-base.h
 create mode 100644 gcc/pair-fusion-common.cc
 create mode 100644 gcc/pair-fusion.cc

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index a74761b7ab3..df5061ddfe7 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1563,6 +1563,8 @@ OBJS = \
ipa-strub.o \
ipa.o \
ira.o \
+   pair-fusion-common.o \
+   pair-fusion.o \
ira-build.o \
ira-costs.o \
ira-conflicts.o \
diff --git a/gcc/pair-fusion-base.h b/gcc/pair-fusion-base.h
new file mode 100644
index 000..53393c1f823
--- /dev/null
+++ b/gcc/pair-fusion-base.h
@@ -0,0 +1,618 @@
+// Generic code for Pair MEM  fusion optimization pass.
+// Copyright (C) 2024 Free Software Foundation, Inc.
+//
+// This file is part of GCC.
+//
+// GCC is free software; you can redistribute it and/or modify it
+// under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 3, or (at your option)
+// any later version.
+//
+// GCC is distributed in the hope that it will be useful, but
+// WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+// General Public License for more details.
+//
+// You should have received a copy of the GNU General Public License
+// along with GCC; see the file COPYING3.  If not see
+// .
+
+#ifndef GCC_PAIR_FUSION_H
+#define GCC_PAIR_FUSION_H
+#define INCLUDE_ALGORITHM
+#define INCLUDE_FUNCTIONAL
+#define INCLUDE_LIST
+#define INCLUDE_TYPE_TRAITS
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "df.h"
+#include "rtl-iter.h"
+#include "rtl-ssa.h"
+#include "cfgcleanup.h"
+#include "tree-pass.h"
+#include "ordered-hash-map.h"
+#include "tree-dfa.h"
+#include "fold-const.h"
+#include "tree-hash-traits.h"
+#include "print-tree.h"
+#include "insn-attr.h"
+using namespace rtl_ssa;
+// We pack these fields (load_p, fpsimd_p, and size) into an integer
+// (LFS) which we use as part of the key into the main hash tables.
+//
+// The idea is that we group candidates together only if they agree on
+// the fields below.  Candidates that disagree on any of these
+// properties shouldn't be merged together.
+struct lfs_fields
+{
+  bool load_p;
+  bool fpsimd_p;
+  unsigned size;
+};
+
+using insn_list_t = std::list;
+using insn_iter_t = insn_list_t::iterator;
+
+// Information about the accesses at a given offset from a particular
+// base.  Stored in an access_group, see below.
+struct access_record
+{
+  poly_int64 offset;
+  std::list cand_insns;
+  std::list::iterator place;
+
+  access_record (poly_int64 off) : offset (off) {}
+};
+
+// A group of accesses where adjacent accesses could be ldp/stp
+// candidates.  The splay tree supports efficient insertion,
+// while the list supports efficient iteration.
+struct access_group
+{
+  splay_tree tree;
+  std::list list;
+
+  template
+  inline void track (Alloc node_alloc, poly_int64 offset, insn_info *

[PATCH V1 1/1] rs6000: Load store fusion for rs6000 target using common infrastructure

2024-03-13 Thread Ajit Agarwal
Hello All:

Common infrastructure using generic code for load store fusion of rs6000
target.

Generic code are implemented and defined  that can be used in target specific
code for aarch64 and rs6000 target.

Generic code are implemeneted in gcc/pair-fusion-base.h, 
gcc/pair-fusion-common.cc
and gcc/pair-fusion.cc.

Code is implemented with pure virtual functions to interface with target
code.

Target specific code are added in rs600-mem-fusion.cc that uses generic code.

Bootstrapped and regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit

rs6000: Load store fusion for rs6000 target using common infrastructure

Common infrastructure using generic code for load store fusion of rs6000
target.

Generic code are implemented and defined  that can be used in target specific
code for aarch64 and rs6000 target.

Generic code are implemeneted in gcc/pair-fusion-base.h, 
gcc/pair-fusion-common.cc
and gcc/pair-fusion.cc.

Code is implemented with pure virtual functions to interface with target
code.

Target specific code are added in rs600-mem-fusion.cc that uses generic code.

2024-03-13  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/rs6000/rs6000-passes.def: New mem fusion pass
before pass_early_remat.
* config/rs6000/rs6000-mem-fusion.cc: Add new pass.
Add target specific implementation using pure virtual
functions.
* config.gcc: Add new executable.
* config/rs6000/rs6000-protos.h: Add new prototype for mem
fusion pass.
* config/rs6000/rs6000.cc: Add new prototype for mem fusion
pass.
* config/rs6000/t-rs6000: Add new rule.
* rtl-ssa/accesses.h: Moved set_is_live_out_use as public
from private.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/vecload-fusion.C: New test.
* g++.target/powerpc/vecload-fusion_1.C: New test.
* gcc.target/powerpc/mma-builtin-1.c: Modify test.
---
 gcc/config.gcc|   2 +
 gcc/config/rs6000/rs6000-mem-fusion.cc| 704 ++
 gcc/config/rs6000/rs6000-passes.def   |   4 +-
 gcc/config/rs6000/rs6000-protos.h |   1 +
 gcc/config/rs6000/rs6000.cc   |   1 +
 gcc/config/rs6000/t-rs6000|   5 +
 gcc/rtl-ssa/accesses.h|   2 +-
 .../g++.target/powerpc/mem-fusion-1.C |  22 +
 gcc/testsuite/g++.target/powerpc/mem-fusion.C |  15 +
 .../gcc.target/powerpc/mma-builtin-1.c|   4 +-
 10 files changed, 756 insertions(+), 4 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-mem-fusion.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/mem-fusion-1.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/mem-fusion.C

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 624e0dae191..52ecd66dcc6 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -522,6 +522,7 @@ powerpc*-*-*)
extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
+   extra_objs="${extra_objs} rs6000-mem-fusion.o"
extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
extra_headers="${extra_headers} bmi2intrin.h bmiintrin.h"
extra_headers="${extra_headers} xmmintrin.h mm_malloc.h emmintrin.h"
@@ -558,6 +559,7 @@ rs6000*-*-*)
extra_options="${extra_options} g.opt fused-madd.opt 
rs6000/rs6000-tables.opt"
extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
+   extra_objs="${extra_objs} rs6000-mem-fusion.o"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-logue.cc 
\$(srcdir)/config/rs6000/rs6000-call.cc"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
;;
diff --git a/gcc/config/rs6000/rs6000-mem-fusion.cc 
b/gcc/config/rs6000/rs6000-mem-fusion.cc
new file mode 100644
index 000..3a92d5be61c
--- /dev/null
+++ b/gcc/config/rs6000/rs6000-mem-fusion.cc
@@ -0,0 +1,704 @@
+/* Subroutines used to replace lxv with lxvp
+   for TARGET_POWER10 and TARGET_VSX,
+
+   Copyright (C) 2020-2023 Free Software Foundation, Inc.
+   Contributed by Ajit Kumar Agarwal .
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   .  */

[PATCH V2 0/1] rs6000: Load store fusion for rs6000 target using common infrastructure

2024-03-13 Thread Ajit Agarwal


Hello All:

Common infrastructure using generic code for load store fusion of rs6000
target.

This patch is split-patch 0 which uses generic code are implemented and defined
that can be used in target specific code for aarch64 and rs6000 target.

Generic code are implemeneted in gcc/pair-fusion-base.h, 
gcc/pair-fusion-common.cc
and gcc/pair-fusion.cc.

Code is implemented with pure virtual functions to interface with target
code.

Target specific code are added in rs600-mem-fusion.cc that uses generic code.

Bootstrapped and regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit


rs6000: Load store fusion for rs6000 target using common infrastructure

Common infrastructure using generic code for load store fusion of rs6000
target.

Generic code are implemented and defined  that can be used in target specific
code for aarch64 and rs6000 target.

Generic code are implemeneted in gcc/pair-fusion-base.h, 
gcc/pair-fusion-common.cc
and gcc/pair-fusion.cc.

Code is implemented with pure virtual functions to interface with target
code.

Target specific code are added in rs600-mem-fusion.cc that uses generic code.

2024-03-13  Ajit Kumar Agarwal  

gcc/ChangeLog:

* pair-fusion-base.h: Generic header code for load store fusion
that can be shared across different architectures.
* pair-fusion-common.cc: Generic source code for load store
fusion that can be shared across different architectures.
* pair-fusion.cc: Generic implementation of pair_fusion class
defined in pair-fusion-base.h
* Makefile.in: Add new executable pair-fusion.o and
pair-fusion-common.o.
---
 gcc/Makefile.in   |2 +
 gcc/pair-fusion-base.h|  613 ++
 gcc/pair-fusion-common.cc | 1200 
 gcc/pair-fusion.cc| 1230 +
 4 files changed, 3045 insertions(+)
 create mode 100644 gcc/pair-fusion-base.h
 create mode 100644 gcc/pair-fusion-common.cc
 create mode 100644 gcc/pair-fusion.cc

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index a74761b7ab3..df5061ddfe7 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1563,6 +1563,8 @@ OBJS = \
ipa-strub.o \
ipa.o \
ira.o \
+   pair-fusion-common.o \
+   pair-fusion.o \
ira-build.o \
ira-costs.o \
ira-conflicts.o \
diff --git a/gcc/pair-fusion-base.h b/gcc/pair-fusion-base.h
new file mode 100644
index 000..0d9b5db12be
--- /dev/null
+++ b/gcc/pair-fusion-base.h
@@ -0,0 +1,613 @@
+// Generic code for Pair MEM  fusion optimization pass.
+// Copyright (C) 2024 Free Software Foundation, Inc.
+//
+// This file is part of GCC.
+//
+// GCC is free software; you can redistribute it and/or modify it
+// under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 3, or (at your option)
+// any later version.
+//
+// GCC is distributed in the hope that it will be useful, but
+// WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+// General Public License for more details.
+//
+// You should have received a copy of the GNU General Public License
+// along with GCC; see the file COPYING3.  If not see
+// .
+
+#ifndef GCC_PAIR_FUSION_H
+#define GCC_PAIR_FUSION_H
+#define INCLUDE_ALGORITHM
+#define INCLUDE_FUNCTIONAL
+#define INCLUDE_LIST
+#define INCLUDE_TYPE_TRAITS
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "df.h"
+#include "rtl-iter.h"
+#include "rtl-ssa.h"
+#include "cfgcleanup.h"
+#include "tree-pass.h"
+#include "ordered-hash-map.h"
+#include "tree-dfa.h"
+#include "fold-const.h"
+#include "tree-hash-traits.h"
+#include "print-tree.h"
+#include "insn-attr.h"
+using namespace rtl_ssa;
+// We pack these fields (load_p, fpsimd_p, and size) into an integer
+// (LFS) which we use as part of the key into the main hash tables.
+//
+// The idea is that we group candidates together only if they agree on
+// the fields below.  Candidates that disagree on any of these
+// properties shouldn't be merged together.
+struct lfs_fields
+{
+  bool load_p;
+  bool fpsimd_p;
+  unsigned size;
+};
+
+using insn_list_t = std::list;
+using insn_iter_t = insn_list_t::iterator;
+
+// Information about the accesses at a given offset from a particular
+// base.  Stored in an access_group, see below.
+struct access_record
+{
+  poly_int64 offset;
+  std::list cand_insns;
+  std::list::iterator place;
+
+  access_record (poly_int64 off) : offset (off) {}
+};
+
+// A group of accesses where adjacent accesses could be ldp/stp
+// candidates.  The splay tree supports efficient insertion,
+// while the list supports efficient iteration.
+struct access_group
+{
+  splay_tree tree;
+  std::list list;
+
+  template
+  inline void track (Alloc node_alloc, poly_int64 offset, insn_info

[PATCH V2 1/1] rs6000: Load store fusion for rs6000 target using common infrastructure

2024-03-13 Thread Ajit Agarwal
Hello All:

Common infrastructure using generic code for load store fusion of rs6000
target.

Generic code are implemented and defined  that can be used in target specific
code for aarch64 and rs6000 target.

Generic code are implemeneted in gcc/pair-fusion-base.h, 
gcc/pair-fusion-common.cc
and gcc/pair-fusion.cc.

Code is implemented with pure virtual functions to interface with target
code.

Target specific code are added in rs600-mem-fusion.cc that uses generic code.

Bootstrapped and regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit

rs6000: Load store fusion for rs6000 target using common infrastructure

Common infrastructure using generic code for load store fusion of rs6000
target.

Generic code are implemented and defined  that can be used in target specific
code for aarch64 and rs6000 target.

Generic code are implemeneted in gcc/pair-fusion-base.h, 
gcc/pair-fusion-common.cc
and gcc/pair-fusion.cc.

Code is implemented with pure virtual functions to interface with target
code.

Target specific code are added in rs600-mem-fusion.cc that uses generic code.

2024-03-13  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/rs6000/rs6000-passes.def: New mem fusion pass
before pass_early_remat.
* config/rs6000/rs6000-mem-fusion.cc: Add new pass.
Add target specific implementation using pure virtual
functions.
* config.gcc: Add new executable.
* config/rs6000/rs6000-protos.h: Add new prototype for mem
fusion pass.
* config/rs6000/rs6000.cc: Add new prototype for mem fusion
pass.
* config/rs6000/t-rs6000: Add new rule.
* rtl-ssa/accesses.h: Moved set_is_live_out_use as public
from private.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/me-fusion.C: New test.
* g++.target/powerpc/mem-fusion-1.C: New test.
* gcc.target/powerpc/mma-builtin-1.c: Modify test.
---
 gcc/config.gcc|   2 +
 gcc/config/rs6000/rs6000-mem-fusion.cc| 697 ++
 gcc/config/rs6000/rs6000-passes.def   |   4 +-
 gcc/config/rs6000/rs6000-protos.h |   1 +
 gcc/config/rs6000/rs6000.cc   |   1 +
 gcc/config/rs6000/t-rs6000|   5 +
 gcc/rtl-ssa/accesses.h|   2 +-
 .../g++.target/powerpc/mem-fusion-1.C |  22 +
 gcc/testsuite/g++.target/powerpc/mem-fusion.C |  15 +
 .../gcc.target/powerpc/mma-builtin-1.c|   4 +-
 10 files changed, 749 insertions(+), 4 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-mem-fusion.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/mem-fusion-1.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/mem-fusion.C

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 624e0dae191..52ecd66dcc6 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -522,6 +522,7 @@ powerpc*-*-*)
extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
+   extra_objs="${extra_objs} rs6000-mem-fusion.o"
extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
extra_headers="${extra_headers} bmi2intrin.h bmiintrin.h"
extra_headers="${extra_headers} xmmintrin.h mm_malloc.h emmintrin.h"
@@ -558,6 +559,7 @@ rs6000*-*-*)
extra_options="${extra_options} g.opt fused-madd.opt 
rs6000/rs6000-tables.opt"
extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
+   extra_objs="${extra_objs} rs6000-mem-fusion.o"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-logue.cc 
\$(srcdir)/config/rs6000/rs6000-call.cc"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
;;
diff --git a/gcc/config/rs6000/rs6000-mem-fusion.cc 
b/gcc/config/rs6000/rs6000-mem-fusion.cc
new file mode 100644
index 000..3522582f6fb
--- /dev/null
+++ b/gcc/config/rs6000/rs6000-mem-fusion.cc
@@ -0,0 +1,697 @@
+/* Subroutines used to replace lxv with lxvp
+   for TARGET_POWER10 and TARGET_VSX,
+
+   Copyright (C) 2024 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   .  */
+
+#define IN_TARGET_CODE 1
+#define INCLUDE_ALGORITH

[PING^0][PATCH V3 0/2] aarch64: Place target independent and dependent changed and unchanged code in one file.

2024-03-18 Thread Ajit Agarwal
Hello Richard/Alex:

Ping!

Please reply.

Thanks & Regards
Ajit

On 27/02/24 12:33 pm, Ajit Agarwal wrote:
> Hello Richard/Alex:
> 
> This patch has better diff with changed and unchanged code.
> Unchanged code and some of the changed code  will be extracted 
> into target independent headers and sources wherein target
> deoendent code changed and unchanged code would be in target
> dependent file like aarch64-ldp-fusion
> 
> Please review.
> 
> Thanks & Regards
> Ajit
> 
> On 23/02/24 4:41 pm, Ajit Agarwal wrote:
>> Hello Richard/Alex/Segher:
>>
>> This patch adds the changed code for target independent and
>> dependent code for load store fusion.
>>
>> Common infrastructure of load store pair fusion is
>> divided into target independent and target dependent
>> changed code.
>>
>> Target independent code is the Generic code with
>> pure virtual function to interface betwwen target
>> independent and dependent code.
>>
>> Target dependent code is the implementation of pure
>> virtual function for aarch64 target and the call
>> to target independent code.
>>
>> Bootstrapped for aarch64-linux-gnu.
>>
>> Thanks & Regards
>> Ajit
>>
>> aarch64: Place target independent and dependent changed code in one file.
>>
>> Common infrastructure of load store pair fusion is
>> divided into target independent and target dependent
>> changed code.
>>
>> Target independent code is the Generic code with
>> pure virtual function to interface betwwen target
>> independent and dependent code.
>>
>> Target dependent code is the implementation of pure
>> virtual function for aarch64 target and the call
>> to target independent code.
>>
>> 2024-02-23  Ajit Kumar Agarwal  
>>
>> gcc/ChangeLog:
>>
>>  * config/aarch64/aarch64-ldp-fusion.cc: Place target
>>  independent and dependent changed code.
>> ---
>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 437 ---
>>  1 file changed, 305 insertions(+), 132 deletions(-)
>>
>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
>> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> index 22ed95eb743..2ef22ff1e96 100644
>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> @@ -40,10 +40,10 @@
>>  
>>  using namespace rtl_ssa;
>>  
>> -static constexpr HOST_WIDE_INT LDP_IMM_BITS = 7;
>> -static constexpr HOST_WIDE_INT LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1));
>> -static constexpr HOST_WIDE_INT LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1;
>> -static constexpr HOST_WIDE_INT LDP_MIN_IMM = -LDP_MAX_IMM - 1;
>> +static constexpr HOST_WIDE_INT PAIR_MEM_IMM_BITS = 7;
>> +static constexpr HOST_WIDE_INT PAIR_MEM_IMM_SIGN_BIT = (1 << 
>> (PAIR_MEM_IMM_BITS - 1));
>> +static constexpr HOST_WIDE_INT PAIR_MEM_MAX_IMM = PAIR_MEM_IMM_SIGN_BIT - 1;
>> +static constexpr HOST_WIDE_INT PAIR_MEM_MIN_IMM = -PAIR_MEM_MAX_IMM - 1;
>>  
>>  // We pack these fields (load_p, fpsimd_p, and size) into an integer
>>  // (LFS) which we use as part of the key into the main hash tables.
>> @@ -138,8 +138,18 @@ struct alt_base
>>poly_int64 offset;
>>  };
>>  
>> +// Virtual base class for load/store walkers used in alias analysis.
>> +struct alias_walker
>> +{
>> +  virtual bool conflict_p (int &budget) const = 0;
>> +  virtual insn_info *insn () const = 0;
>> +  virtual bool valid () const  = 0;
>> +  virtual void advance () = 0;
>> +};
>> +
>> +
>>  // State used by the pass for a given basic block.
>> -struct ldp_bb_info
>> +struct pair_fusion
>>  {
>>using def_hash = nofree_ptr_hash;
>>using expr_key_t = pair_hash>;
>> @@ -161,13 +171,13 @@ struct ldp_bb_info
>>static const size_t obstack_alignment = sizeof (void *);
>>bb_info *m_bb;
>>  
>> -  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
>> +  pair_fusion (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
>>{
>>  obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
>>  obstack_alignment, obstack_chunk_alloc,
>>  obstack_chunk_free);
>>}
>> -  ~ldp_bb_info ()
>> +  ~pair_fusion ()
>>{
>>  obstack_free (&m_obstack, nullptr);
>>  
>> @@ -177,10 +187,50 @@ struct ldp_bb_info
>>  bitmap_obstack_release (&m_bitmap_obstack);
>>}
>>}
>> +  

[PATCH] rs6000: Stackoverflow in optimized code on PPC (PR100799)

2024-03-22 Thread Ajit Agarwal
Hello All:


When using FlexiBLAS with OpenBLAS we noticed corruption of
the parameters passed to OpenBLAS functions. FlexiBLAS
basically provides a BLAS interface where each function
is a stub that forwards the arguments to a real BLAS lib,
like OpenBLAS.

Fixes the corruption of caller frame checking number of
arguments is less than equal to GP_ARG_NUM_REG (8)
excluding hidden unused DECLS.

Bootstrapped and regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit


rs6000: Stackoverflow in optimized code on PPC (PR100799)

When using FlexiBLAS with OpenBLAS we noticed corruption of
the parameters passed to OpenBLAS functions. FlexiBLAS
basically provides a BLAS interface where each function
is a stub that forwards the arguments to a real BLAS lib,
like OpenBLAS.

Fixes the corruption of caller frame checking number of
arguments is less than equal to GP_ARG_NUM_REG (8)
excluding hidden unused DECLS.

2024-03-22  Ajit Kumar Agarwal  

gcc/ChangeLog:

PR rtk-optimization/100799
* config/rs600/rs600-calls.cc (rs6000_function_arg): Don't
generate parameter save area if number of arguments passed
less than equal to GP_ARG_NUM_REG (8) excluding hidden
paramter.
* function.cc (assign_parms_initialize_all): Check for hidden
parameter in fortran code and set the flag hidden_string_length
and actual paramter passed excluding hidden unused DECLS.
* function.h: Add new field hidden_string_length and
actual_parm_length in function structure.
---
 gcc/config/rs6000/rs6000-call.cc | 11 ++-
 gcc/function.cc  | 26 ++
 gcc/function.h   | 10 ++
 3 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/gcc/config/rs6000/rs6000-call.cc b/gcc/config/rs6000/rs6000-call.cc
index 1f8f93a2ee7..8e6e3de6804 100644
--- a/gcc/config/rs6000/rs6000-call.cc
+++ b/gcc/config/rs6000/rs6000-call.cc
@@ -1857,7 +1857,16 @@ rs6000_function_arg (cumulative_args_t cum_v, const 
function_arg_info &arg)
 
  return rs6000_finish_function_arg (mode, rvec, k);
}
-  else if (align_words < GP_ARG_NUM_REG)
+ /* Workaround buggy C/C++ wrappers around Fortran routines with
+   character(len=constant) arguments if the hidden string length arguments
+   are passed on the stack; if the callers forget to pass those arguments,
+   attempting to tail call in such routines leads to stack corruption.
+   Avoid return stack space for parameters <= 8 excluding hidden string
+   length argument is passed (partially or fully) on the stack in the
+   caller and the callee needs to pass any arguments on the stack.  */
+  else if (align_words < GP_ARG_NUM_REG
+  || (cfun->hidden_string_length
+  && cfun->actual_parm_length <= GP_ARG_NUM_REG))
{
  if (TARGET_32BIT && TARGET_POWERPC64)
return rs6000_mixed_function_arg (mode, type, align_words);
diff --git a/gcc/function.cc b/gcc/function.cc
index 3cef6c17bce..1318564b466 100644
--- a/gcc/function.cc
+++ b/gcc/function.cc
@@ -2326,6 +2326,32 @@ assign_parms_initialize_all (struct assign_parm_data_all 
*all)
 #endif
   all->args_so_far = pack_cumulative_args (&all->args_so_far_v);
 
+  unsigned int num_args = 0;
+  unsigned int hidden_length = 0;
+
+  /* Workaround buggy C/C++ wrappers around Fortran routines with
+ character(len=constant) arguments if the hidden string length arguments
+ are passed on the stack; if the callers forget to pass those arguments,
+ attempting to tail call in such routines leads to stack corruption.
+ Avoid return stack space for parameters <= 8 excluding hidden string
+ length argument is passed (partially or fully) on the stack in the
+ caller and the callee needs to pass any arguments on the stack.  */
+  for (tree arg = DECL_ARGUMENTS (current_function_decl);
+   arg; arg = DECL_CHAIN (arg))
+{
+  num_args++;
+  if (DECL_HIDDEN_STRING_LENGTH (arg))
+   {
+ tree parmdef = ssa_default_def (cfun, arg);
+ if (parmdef == NULL || has_zero_uses (parmdef))
+   {
+ cfun->hidden_string_length = 1;
+ hidden_length++;
+   }
+   }
+   }
+
+  cfun->actual_parm_length = num_args - hidden_length;
 #ifdef INCOMING_REG_PARM_STACK_SPACE
   all->reg_parm_stack_space
 = INCOMING_REG_PARM_STACK_SPACE (current_function_decl);
diff --git a/gcc/function.h b/gcc/function.h
index 19e15bd63b0..5984f0007c2 100644
--- a/gcc/function.h
+++ b/gcc/function.h
@@ -346,6 +346,11 @@ struct GTY(()) function {
   /* Last assigned dependence info clique.  */
   unsigned short last_clique;
 
+  /* Actual parameter length ignoring hidden paramter.
+ This is done to C++ wrapper calling fortran module
+ which has hidden parameter that are not used.  */
+  unsigned int actual_parm_length;
+
   /* Collected bit flags.  */
 
   /* Number of units of general register

[PATCH v1] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-03-22 Thread Ajit Agarwal
Hello Jakub:

When using FlexiBLAS with OpenBLAS we noticed corruption of
the parameters passed to OpenBLAS functions. FlexiBLAS
basically provides a BLAS interface where each function
is a stub that forwards the arguments to a real BLAS lib,
like OpenBLAS.

Fixes the corruption of caller frame checking number of
arguments is less than equal to GP_ARG_NUM_REG (8)
excluding hidden unused DECLS.

Bootstrapped and regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit


rs6000: Stackoverflow in optimized code on PPC [PR100799]

When using FlexiBLAS with OpenBLAS we noticed corruption of
the parameters passed to OpenBLAS functions. FlexiBLAS
basically provides a BLAS interface where each function
is a stub that forwards the arguments to a real BLAS lib,
like OpenBLAS.

Fixes the corruption of caller frame checking number of
arguments is less than equal to GP_ARG_NUM_REG (8)
excluding hidden unused DECLS.

2024-03-22  Ajit Kumar Agarwal  

gcc/ChangeLog:

PR rtk-optimization/100799
* config/rs6000/rs6000-calls.cc (rs6000_function_arg): Don't
generate parameter save area if number of arguments passed
less than equal to GP_ARG_NUM_REG (8) excluding hidden
parameter.
(init_cumulative_args): Check for hidden parameter in fortran
routine and set the flag hidden_string_length and actual
parameter passed excluding hidden unused DECLS.
* config/rs6000/rs6000.h (rs6000_args): Add new field
hidden_string_length and actual_parm_length.
---
 gcc/config/rs6000/rs6000-call.cc | 40 ++--
 gcc/config/rs6000/rs6000.h   |  8 +++
 2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/gcc/config/rs6000/rs6000-call.cc b/gcc/config/rs6000/rs6000-call.cc
index 1f8f93a2ee7..2620ce16943 100644
--- a/gcc/config/rs6000/rs6000-call.cc
+++ b/gcc/config/rs6000/rs6000-call.cc
@@ -64,7 +64,7 @@
 #include "ppc-auxv.h"
 #include "targhooks.h"
 #include "opts.h"
-
+#include "tree-dfa.h"
 #include "rs6000-internal.h"
 
 #ifndef TARGET_PROFILE_KERNEL
@@ -584,6 +584,33 @@ init_cumulative_args (CUMULATIVE_ARGS *cum, tree fntype,
   if (incoming || cum->prototype)
 cum->nargs_prototype = n_named_args;
 
+  /* Workaround buggy C/C++ wrappers around Fortran routines with
+ character(len=constant) arguments if the hidden string length arguments
+ are passed on the stack; if the callers forget to pass those arguments,
+ attempting to tail call in such routines leads to stack corruption.
+ Avoid return stack space for parameters <= 8 excluding hidden string
+ length argument is passed (partially or fully) on the stack in the
+ caller and the callee needs to pass any arguments on the stack.  */
+  unsigned int num_args = 0;
+  unsigned int hidden_length = 0;
+
+  for (tree arg = DECL_ARGUMENTS (current_function_decl);
+   arg; arg = DECL_CHAIN (arg))
+{
+  num_args++;
+  if (DECL_HIDDEN_STRING_LENGTH (arg))
+   {
+ tree parmdef = ssa_default_def (cfun, arg);
+ if (parmdef == NULL || has_zero_uses (parmdef))
+   {
+ cum->hidden_string_length = 1;
+ hidden_length++;
+   }
+   }
+   }
+
+  cum->actual_parm_length = num_args - hidden_length;
+
   /* Check for a longcall attribute.  */
   if ((!fntype && rs6000_default_long_calls)
   || (fntype
@@ -1857,7 +1884,16 @@ rs6000_function_arg (cumulative_args_t cum_v, const 
function_arg_info &arg)
 
  return rs6000_finish_function_arg (mode, rvec, k);
}
-  else if (align_words < GP_ARG_NUM_REG)
+ /* Workaround buggy C/C++ wrappers around Fortran routines with
+   character(len=constant) arguments if the hidden string length arguments
+   are passed on the stack; if the callers forget to pass those arguments,
+   attempting to tail call in such routines leads to stack corruption.
+   Avoid return stack space for parameters <= 8 excluding hidden string
+   length argument is passed (partially or fully) on the stack in the
+   caller and the callee needs to pass any arguments on the stack.  */
+  else if (align_words < GP_ARG_NUM_REG
+  || (cum->hidden_string_length
+  && cum->actual_parm_length <= GP_ARG_NUM_REG))
{
  if (TARGET_32BIT && TARGET_POWERPC64)
return rs6000_mixed_function_arg (mode, type, align_words);
diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
index 68bc45d65ba..a1d3ed00b14 100644
--- a/gcc/config/rs6000/rs6000.h
+++ b/gcc/config/rs6000/rs6000.h
@@ -1490,6 +1490,14 @@ typedef struct rs6000_args
   int named;   /* false for varargs params */
   int escapes; /* if function visible outside tu */
   int libcall; /* If this is a compiler generated call.  */
+  /* Actual parameter length ignoring hidden paramter.
+ This is done to C++ wrapper calling fortran module
+ which has hidden parameter 

Re: [PATCH] rs6000: Stackoverflow in optimized code on PPC (PR100799)

2024-03-22 Thread Ajit Agarwal
Hello Jakub:

Addressed the below comments and sent version 1 of the patch
for review.

Thanks & Regards
Ajit

On 22/03/24 1:15 pm, Jakub Jelinek wrote:
> On Fri, Mar 22, 2024 at 01:00:21PM +0530, Ajit Agarwal wrote:
>> When using FlexiBLAS with OpenBLAS we noticed corruption of
>> the parameters passed to OpenBLAS functions. FlexiBLAS
>> basically provides a BLAS interface where each function
>> is a stub that forwards the arguments to a real BLAS lib,
>> like OpenBLAS.
>>
>> Fixes the corruption of caller frame checking number of
>> arguments is less than equal to GP_ARG_NUM_REG (8)
>> excluding hidden unused DECLS.
> 
> Thanks for working on this.
> 
>> 2024-03-22  Ajit Kumar Agarwal  
>>
>> gcc/ChangeLog:
>>
>> PR rtk-optimization/100799
>> * config/rs600/rs600-calls.cc (rs6000_function_arg): Don't
> 
> These 2 lines are 8 space indented rather than tab.
> 
>>  generate parameter save area if number of arguments passed
>>  less than equal to GP_ARG_NUM_REG (8) excluding hidden
>>  paramter.
>>  * function.cc (assign_parms_initialize_all): Check for hidden
>>  parameter in fortran code and set the flag hidden_string_length
>>  and actual paramter passed excluding hidden unused DECLS.
> 
> s/paramter/parameter/
> 
>>  * function.h: Add new field hidden_string_length and
>>  actual_parm_length in function structure.
> 
> Why do you need to change generic code for something that will only be
> used by a single target?
> I mean, why don't you add the extra members in rs6000.h (struct rs6000_args)
> and initialize them in rs6000-call.cc (init_cumulative_args) -
> the function.cc function you've modified is the only one which uses
> INIT_CUMULATIVE_INCOMING_ARGS and in that case init_cumulative_args is
> called with incoming == true, so move the stuff from function.cc there.
> 
>> --- a/gcc/config/rs6000/rs6000-call.cc
>> +++ b/gcc/config/rs6000/rs6000-call.cc
>> @@ -1857,7 +1857,16 @@ rs6000_function_arg (cumulative_args_t cum_v, const 
>> function_arg_info &arg)
>>  
>>return rs6000_finish_function_arg (mode, rvec, k);
>>  }
>> -  else if (align_words < GP_ARG_NUM_REG)
>> + /* Workaround buggy C/C++ wrappers around Fortran routines with
>> +character(len=constant) arguments if the hidden string length arguments
>> +are passed on the stack; if the callers forget to pass those arguments,
>> +attempting to tail call in such routines leads to stack corruption.
>> +Avoid return stack space for parameters <= 8 excluding hidden string
>> +length argument is passed (partially or fully) on the stack in the
>> +caller and the callee needs to pass any arguments on the stack.  */
>> +  else if (align_words < GP_ARG_NUM_REG
>> +   || (cfun->hidden_string_length
>> +   && cfun->actual_parm_length <= GP_ARG_NUM_REG))
>>  {
>>if (TARGET_32BIT && TARGET_POWERPC64)
>>  return rs6000_mixed_function_arg (mode, type, align_words);
>> diff --git a/gcc/function.cc b/gcc/function.cc
>> index 3cef6c17bce..1318564b466 100644
>> --- a/gcc/function.cc
>> +++ b/gcc/function.cc
>> @@ -2326,6 +2326,32 @@ assign_parms_initialize_all (struct 
>> assign_parm_data_all *all)
>>  #endif
>>all->args_so_far = pack_cumulative_args (&all->args_so_far_v);
>>  
>> +  unsigned int num_args = 0;
>> +  unsigned int hidden_length = 0;
>> +
>> +  /* Workaround buggy C/C++ wrappers around Fortran routines with
>> + character(len=constant) arguments if the hidden string length arguments
>> + are passed on the stack; if the callers forget to pass those arguments,
>> + attempting to tail call in such routines leads to stack corruption.
>> + Avoid return stack space for parameters <= 8 excluding hidden string
>> + length argument is passed (partially or fully) on the stack in the
>> + caller and the callee needs to pass any arguments on the stack.  */
>> +  for (tree arg = DECL_ARGUMENTS (current_function_decl);
>> +   arg; arg = DECL_CHAIN (arg))
>> +{
>> +  num_args++;
>> +  if (DECL_HIDDEN_STRING_LENGTH (arg))
>> +{
>> +  tree parmdef = ssa_default_def (cfun, arg);
>> +  if (parmdef == NULL || has_zero_uses (parmdef))
>> +{
>> +  cfun->hidden_string_length = 1;
>> +  hidden_length++;
>> +}
>> +}
>> +   }
>> +
>> +  cfun->actual_parm_length

[PATCH v2] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-03-22 Thread Ajit Agarwal
Hello All:

This is version-2 of the patch with review comments addressed.

When using FlexiBLAS with OpenBLAS we noticed corruption of
the parameters passed to OpenBLAS functions. FlexiBLAS
basically provides a BLAS interface where each function
is a stub that forwards the arguments to a real BLAS lib,
like OpenBLAS.

Fixes the corruption of caller frame checking number of
arguments is less than equal to GP_ARG_NUM_REG (8)
excluding hidden unused DECLS.

Bootstrapped and regtested for powerpc64-linux.gnu.

Thanks & Regards
Ajit


rs6000: Stackoverflow in optimized code on PPC [PR100799]

When using FlexiBLAS with OpenBLAS we noticed corruption of
the parameters passed to OpenBLAS functions. FlexiBLAS
basically provides a BLAS interface where each function
is a stub that forwards the arguments to a real BLAS lib,
like OpenBLAS.

Fixes the corruption of caller frame checking number of
arguments is less than equal to GP_ARG_NUM_REG (8)
excluding hidden unused DECLS.

2024-03-22  Ajit Kumar Agarwal  

gcc/ChangeLog:

PR rtk-optimization/100799
* config/rs6000/rs6000-calls.cc (rs6000_function_arg): Don't
generate parameter save area if number of arguments passed
less than equal to GP_ARG_NUM_REG (8) excluding hidden
parameter.
(init_cumulative_args): Check for hidden parameter in fortran
routine and set the flag hidden_string_length and actual
parameter passed excluding hidden unused DECLS.
* config/rs6000/rs6000.h (rs6000_args): Add new field
hidden_string_length and actual_parm_length.
---
 gcc/config/rs6000/rs6000-call.cc | 36 ++--
 gcc/config/rs6000/rs6000.h   |  7 +++
 2 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/gcc/config/rs6000/rs6000-call.cc b/gcc/config/rs6000/rs6000-call.cc
index 1f8f93a2ee7..fd823c66ea2 100644
--- a/gcc/config/rs6000/rs6000-call.cc
+++ b/gcc/config/rs6000/rs6000-call.cc
@@ -64,7 +64,7 @@
 #include "ppc-auxv.h"
 #include "targhooks.h"
 #include "opts.h"
-
+#include "tree-dfa.h"
 #include "rs6000-internal.h"
 
 #ifndef TARGET_PROFILE_KERNEL
@@ -584,6 +584,31 @@ init_cumulative_args (CUMULATIVE_ARGS *cum, tree fntype,
   if (incoming || cum->prototype)
 cum->nargs_prototype = n_named_args;
 
+  /* When the buggy C/C++ wrappers call the function with fewer arguments
+ than it actually has and doesn't expect the parameter save area on the
+ caller side because of that while the callee expects it and the callee
+ actually stores something in the parameter save area, it corrupts
+ whatever is in the caller stack frame at that location.  */
+  unsigned int num_args = 0;
+  unsigned int hidden_length = 0;
+
+  for (tree arg = DECL_ARGUMENTS (current_function_decl);
+   arg; arg = DECL_CHAIN (arg))
+{
+  num_args++;
+  if (DECL_HIDDEN_STRING_LENGTH (arg))
+   {
+ tree parmdef = ssa_default_def (cfun, arg);
+ if (parmdef == NULL || has_zero_uses (parmdef))
+   {
+ cum->hidden_string_length = 1;
+ hidden_length++;
+   }
+   }
+   }
+
+  cum->actual_parm_length = num_args - hidden_length;
+
   /* Check for a longcall attribute.  */
   if ((!fntype && rs6000_default_long_calls)
   || (fntype
@@ -1857,7 +1882,14 @@ rs6000_function_arg (cumulative_args_t cum_v, const 
function_arg_info &arg)
 
  return rs6000_finish_function_arg (mode, rvec, k);
}
-  else if (align_words < GP_ARG_NUM_REG)
+ /* When the buggy C/C++ wrappers call the function with fewer arguments
+   than it actually has and doesn't expect the parameter save area on the
+   caller side because of that while the callee expects it and the callee
+   actually stores something in the parameter save area, it corrupts
+   whatever is in the caller stack frame at that location.  */
+  else if (align_words < GP_ARG_NUM_REG
+  || (cum->hidden_string_length
+  && cum->actual_parm_length <= GP_ARG_NUM_REG))
{
  if (TARGET_32BIT && TARGET_POWERPC64)
return rs6000_mixed_function_arg (mode, type, align_words);
diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
index 68bc45d65ba..60f23f33879 100644
--- a/gcc/config/rs6000/rs6000.h
+++ b/gcc/config/rs6000/rs6000.h
@@ -1490,6 +1490,13 @@ typedef struct rs6000_args
   int named;   /* false for varargs params */
   int escapes; /* if function visible outside tu */
   int libcall; /* If this is a compiler generated call.  */
+  /* Actual parameter length ignoring hidden parameter.
+ This is done to C++ wrapper calling fortran procedures
+ which has hidden parameter that are not used.  */
+  unsigned int actual_parm_length;
+  /* Set if there is hidden parameters while calling C++ wrapper to
+ fortran procedure.  */
+  unsigned int hidden_string_length : 1;
 } CUMULATIVE_ARGS;
 
 /* Initialize a va

Re: [PATCH v1] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-03-22 Thread Ajit Agarwal
Hello Jakub:

Thanks for review. Addressed below review comments and sent
version 2 of the patch for review.

Thanks & Regards
Ajit

On 22/03/24 3:06 pm, Jakub Jelinek wrote:
> On Fri, Mar 22, 2024 at 02:55:43PM +0530, Ajit Agarwal wrote:
>> rs6000: Stackoverflow in optimized code on PPC [PR100799]
>>
>> When using FlexiBLAS with OpenBLAS we noticed corruption of
>> the parameters passed to OpenBLAS functions. FlexiBLAS
>> basically provides a BLAS interface where each function
>> is a stub that forwards the arguments to a real BLAS lib,
>> like OpenBLAS.
>>
>> Fixes the corruption of caller frame checking number of
>> arguments is less than equal to GP_ARG_NUM_REG (8)
>> excluding hidden unused DECLS.
> 
> Looks mostly good to me except some comment nits, but I'll defer
> the actual ack to the rs6000 maintainers.
> 
>> +  /* Workaround buggy C/C++ wrappers around Fortran routines with
>> + character(len=constant) arguments if the hidden string length arguments
>> + are passed on the stack; if the callers forget to pass those arguments,
>> + attempting to tail call in such routines leads to stack corruption.
> 
> I thought it isn't just tail calls, even normal calls.
> When the buggy C/C++ wrappers call the function with fewer arguments
> than it actually has and doesn't expect the parameter save area on the
> caller side because of that while the callee expects it and the callee
> actually stores something in the parameter save area, it corrupts whatever
> is in the caller stack frame at that location.
> 
>> + Avoid return stack space for parameters <= 8 excluding hidden string
>> + length argument is passed (partially or fully) on the stack in the
>> + caller and the callee needs to pass any arguments on the stack.  */
>> +  unsigned int num_args = 0;
>> +  unsigned int hidden_length = 0;
>> +
>> +  for (tree arg = DECL_ARGUMENTS (current_function_decl);
>> +   arg; arg = DECL_CHAIN (arg))
>> +{
>> +  num_args++;
>> +  if (DECL_HIDDEN_STRING_LENGTH (arg))
>> +{
>> +  tree parmdef = ssa_default_def (cfun, arg);
>> +  if (parmdef == NULL || has_zero_uses (parmdef))
>> +{
>> +  cum->hidden_string_length = 1;
>> +  hidden_length++;
>> +}
>> +}
>> +   }
>> +
>> +  cum->actual_parm_length = num_args - hidden_length;
>> +
>>/* Check for a longcall attribute.  */
>>if ((!fntype && rs6000_default_long_calls)
>>|| (fntype
>> @@ -1857,7 +1884,16 @@ rs6000_function_arg (cumulative_args_t cum_v, const 
>> function_arg_info &arg)
>>  
>>return rs6000_finish_function_arg (mode, rvec, k);
>>  }
>> -  else if (align_words < GP_ARG_NUM_REG)
>> + /* Workaround buggy C/C++ wrappers around Fortran routines with
>> +character(len=constant) arguments if the hidden string length arguments
>> +are passed on the stack; if the callers forget to pass those arguments,
>> +attempting to tail call in such routines leads to stack corruption.
>> +Avoid return stack space for parameters <= 8 excluding hidden string
>> +length argument is passed (partially or fully) on the stack in the
>> +caller and the callee needs to pass any arguments on the stack.  */
>> +  else if (align_words < GP_ARG_NUM_REG
>> +   || (cum->hidden_string_length
>> +   && cum->actual_parm_length <= GP_ARG_NUM_REG))
>>  {
>>if (TARGET_32BIT && TARGET_POWERPC64)
>>  return rs6000_mixed_function_arg (mode, type, align_words);
>> diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
>> index 68bc45d65ba..a1d3ed00b14 100644
>> --- a/gcc/config/rs6000/rs6000.h
>> +++ b/gcc/config/rs6000/rs6000.h
>> @@ -1490,6 +1490,14 @@ typedef struct rs6000_args
>>int named;/* false for varargs params */
>>int escapes;  /* if function visible outside tu */
>>int libcall;  /* If this is a compiler generated 
>> call.  */
>> +  /* Actual parameter length ignoring hidden paramter.
> 
> s/paramter/parameter/
> 
>> + This is done to C++ wrapper calling fortran module
>> + which has hidden parameter that are not used.  */
>> +  unsigned int actual_parm_length;
>> +  /* Hidden parameters while calling C++ wrapper to fortran
>> + module. Set if there is hidden parameter in fortran
>> + module while called C++ wrapper.  */
> 
> modules in Fortran are something completely different.
> You should IMHO talk about procedures instead of modules
> in both of the above comments (multiple times even).
> 
>> +  unsigned int hidden_string_length : 1;
>>  } CUMULATIVE_ARGS;
>>  
>>  /* Initialize a variable CUM of type CUMULATIVE_ARGS
>> -- 
>> 2.39.3
> 
>   Jakub
> 


Re: [PATCH v2] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-03-23 Thread Ajit Agarwal
Hello Peter:

On 23/03/24 10:07 am, Peter Bergner wrote:
> On 3/22/24 5:15 AM, Ajit Agarwal wrote:
>> When using FlexiBLAS with OpenBLAS we noticed corruption of
>> the parameters passed to OpenBLAS functions. FlexiBLAS
>> basically provides a BLAS interface where each function
>> is a stub that forwards the arguments to a real BLAS lib,
>> like OpenBLAS.
>>
>> Fixes the corruption of caller frame checking number of
>> arguments is less than equal to GP_ARG_NUM_REG (8)
>> excluding hidden unused DECLS.
> 
> I think the git log entry commentary could be a little more descriptive
> of what the problem is. How about something like the following?
> 
>   When using FlexiBLAS with OpenBLAS, we noticed corruption of the caller
>   stack frame when calling OpenBLAS functions.  This was caused by the
>   FlexiBLAS C/C++ caller and OpenBLAS Fortran callee disagreeing on the
>   number of function parameters in the callee due to hidden Fortran
>   parameters. This can cause problems when the callee believes the caller
>   has allocated a parameter save area when the caller has not done so.
>   That means any writes by the callee into the non-existent parameter save
>   area will corrupt the caller stack frame.
> 
>   The workaround implemented here, is for the callee to determine whether
>   the caller has allocated a parameter save area or not, by ignoring any
>   unused hidden parameters when counting the number of parameters.
> 
> 
I will address this change in the new version of the patch.
> 
>>  PR rtk-optimization/100799
> 
> s/rtk/rtl/
> 
> 
> 
I will address this in new version of the patch.
>>  * config/rs6000/rs6000-calls.cc (rs6000_function_arg): Don't
>>  generate parameter save area if number of arguments passed
>>  less than equal to GP_ARG_NUM_REG (8) excluding hidden
>>  parameter.
> 
> The callee doesn't generate or allocate the parameter save area, the
> caller does.  The code here is for the callee trying to determine
> whether the caller has done so.  How about saying the following instead?
> 
>   Don't assume a parameter save area has been allocated if the number of
>   formal parameters, excluding unused hidden parameters, is less than or
>   equal to GP_ARG_NUM_REG (8).
> 
> 

I will incorporate this change in new version of the patch.

> 
> 
>>  (init_cumulative_args): Check for hidden parameter in fortran
>>  routine and set the flag hidden_string_length and actual
>>  parameter passed excluding hidden unused DECLS.
> 
> Check for unused hidden Fortran parameters and set hidden_string_length
> and actual_parm_length.
> 
>

I will address this change in new version of the patch.

 
>> +  /* When the buggy C/C++ wrappers call the function with fewer arguments
>> + than it actually has and doesn't expect the parameter save area on the
>> + caller side because of that while the callee expects it and the callee
>> + actually stores something in the parameter save area, it corrupts
>> + whatever is in the caller stack frame at that location.  */
> 
> The wrapper/caller is the one that allocates the parameter save area, so
> saying "...doesn't expect the parameter save area on the caller side..."
> doesn't make sense, since it knows whether it allocated it or not.
> How about saying something like the following instead?
> 
>   Check whether this function contains any unused hidden parameters and
>   record how many there are for use in rs6000_function_arg() to determine
>   whether its callers have allocated a parameter save area or not.
>   See PR100799 for details.
> 
> 

I will incorporate this change in new version of the patch.

> 
>> +  unsigned int num_args = 0;
>> +  unsigned int hidden_length = 0;
>> +
>> +  for (tree arg = DECL_ARGUMENTS (current_function_decl);
>> +   arg; arg = DECL_CHAIN (arg))
>> +{
>> +  num_args++;
>> +  if (DECL_HIDDEN_STRING_LENGTH (arg))
>> +{
>> +  tree parmdef = ssa_default_def (cfun, arg);
>> +  if (parmdef == NULL || has_zero_uses (parmdef))
>> +{
>> +  cum->hidden_string_length = 1;
>> +  hidden_length++;
>> +}
>> +}
>> +   }
>> +
>> +  cum->actual_parm_length = num_args - hidden_length;
> 
> This code looks fine, but do we really need two new fields in rs6000_args?
> Can't we just get along with only cum->actual_parm_length by modifying
> the rs6000_function_arg() change from:
> 
>> +  else if (align_words < GP_ARG_NUM_REG
>> +   || (cum->hidden_s

[PATCH v3] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-03-23 Thread Ajit Agarwal
Hello All:

When using FlexiBLAS with OpenBLAS, we noticed corruption of the caller
stack frame when calling OpenBLAS functions.  This was caused by the
FlexiBLAS C/C++ caller and OpenBLAS Fortran callee disagreeing on the
number of function parameters in the callee due to hidden Fortran
parameters. This can cause problems when the callee believes the caller
has allocated a parameter save area when the caller has not done so.
That means any writes by the callee into the non-existent parameter save
area will corrupt the caller stack frame.

The workaround implemented here, is for the callee to determine whether
the caller has allocated a parameter save area or not, by ignoring any
unused hidden parameters when counting the number of parameters.

Bootstrapped and regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit


rs6000: Stackoverflow in optimized code on PPC [PR100799]

When using FlexiBLAS with OpenBLAS, we noticed corruption of the caller
stack frame when calling OpenBLAS functions.  This was caused by the
FlexiBLAS C/C++ caller and OpenBLAS Fortran callee disagreeing on the
number of function parameters in the callee due to hidden Fortran
parameters. This can cause problems when the callee believes the caller
has allocated a parameter save area when the caller has not done so.
That means any writes by the callee into the non-existent parameter save
area will corrupt the caller stack frame.

The workaround implemented here, is for the callee to determine whether
the caller has allocated a parameter save area or not, by ignoring any
unused hidden parameters when counting the number of parameters.

2024-03-23  Ajit Kumar Agarwal  

gcc/ChangeLog:

PR rtl-optimization/100799
* config/rs6000/rs6000-calls.cc (rs6000_function_arg): Don't
assume a parameter save area has been allocated if the number of
formal parameters, excluding unused hidden parameters, is less
than or equal to GP_ARG_NUM_REG (8).
(init_cumulative_args): Check for unused hidden Fortran
parameters and set hidden_string_length and actual_parm_length.
* config/rs6000/rs6000.h (rs6000_args): Add new field
hidden_string_length and actual_parm_length.
---
 gcc/config/rs6000/rs6000-call.cc | 38 ++--
 gcc/config/rs6000/rs6000.h   |  4 
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/gcc/config/rs6000/rs6000-call.cc b/gcc/config/rs6000/rs6000-call.cc
index 1f8f93a2ee7..656735aebaf 100644
--- a/gcc/config/rs6000/rs6000-call.cc
+++ b/gcc/config/rs6000/rs6000-call.cc
@@ -64,7 +64,7 @@
 #include "ppc-auxv.h"
 #include "targhooks.h"
 #include "opts.h"
-
+#include "tree-dfa.h"
 #include "rs6000-internal.h"
 
 #ifndef TARGET_PROFILE_KERNEL
@@ -584,6 +584,32 @@ init_cumulative_args (CUMULATIVE_ARGS *cum, tree fntype,
   if (incoming || cum->prototype)
 cum->nargs_prototype = n_named_args;
 
+  /* When the buggy C/C++ wrappers call the function with fewer arguments
+ than it actually has. Check whether this function contains any unused
+ hidden parameters and record how many there are for use in
+ rs6000_function_arg() to determine whether its callers
+ have allocated a parameter save area or not. See PR100799 for
+ details.  */
+  unsigned int num_args = 0;
+  unsigned int hidden_length = 0;
+
+  for (tree arg = DECL_ARGUMENTS (current_function_decl);
+   arg; arg = DECL_CHAIN (arg))
+{
+  num_args++;
+  if (DECL_HIDDEN_STRING_LENGTH (arg))
+   {
+ tree parmdef = ssa_default_def (cfun, arg);
+ if (parmdef == NULL || has_zero_uses (parmdef))
+   {
+ cum->hidden_string_length = 1;
+ hidden_length++;
+   }
+   }
+   }
+
+  cum->actual_parm_length = num_args - hidden_length;
+
   /* Check for a longcall attribute.  */
   if ((!fntype && rs6000_default_long_calls)
   || (fntype
@@ -1857,7 +1883,15 @@ rs6000_function_arg (cumulative_args_t cum_v, const 
function_arg_info &arg)
 
  return rs6000_finish_function_arg (mode, rvec, k);
}
-  else if (align_words < GP_ARG_NUM_REG)
+ /* When the buggy C/C++ wrappers call the function with fewer arguments
+   than it actually has. Check whether this function contains any unused
+   hidden parameters and record how many there are for use in
+   rs6000_function_arg() to determine whether its callers
+   have allocated a parameter save area or not. See PR100799 for
+   details.  */
+  else if (align_words < GP_ARG_NUM_REG
+  || (cum->hidden_string_length
+  && cum->actual_parm_length <= GP_ARG_NUM_REG))
{
  if (TARGET_32BIT && TARGET_POWERPC64)
return rs6000_mixed_function_arg (mode, type, align_words);
diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
index 68bc45d65ba..a8f91301852 100644
--- a/gcc/config/rs6000/rs6000.h
+++ b/gcc/config/rs6000/rs6000.h
@@ -1490,6 +1490,10

Re: [PATCH v2] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-03-23 Thread Ajit Agarwal
Hello Peter:

Sent version-3 of the patch addressing below review comments.

Thanks & Regards
Ajit

On 23/03/24 3:03 pm, Ajit Agarwal wrote:
> Hello Peter:
> 
> On 23/03/24 10:07 am, Peter Bergner wrote:
>> On 3/22/24 5:15 AM, Ajit Agarwal wrote:
>>> When using FlexiBLAS with OpenBLAS we noticed corruption of
>>> the parameters passed to OpenBLAS functions. FlexiBLAS
>>> basically provides a BLAS interface where each function
>>> is a stub that forwards the arguments to a real BLAS lib,
>>> like OpenBLAS.
>>>
>>> Fixes the corruption of caller frame checking number of
>>> arguments is less than equal to GP_ARG_NUM_REG (8)
>>> excluding hidden unused DECLS.
>>
>> I think the git log entry commentary could be a little more descriptive
>> of what the problem is. How about something like the following?
>>
>>   When using FlexiBLAS with OpenBLAS, we noticed corruption of the caller
>>   stack frame when calling OpenBLAS functions.  This was caused by the
>>   FlexiBLAS C/C++ caller and OpenBLAS Fortran callee disagreeing on the
>>   number of function parameters in the callee due to hidden Fortran
>>   parameters. This can cause problems when the callee believes the caller
>>   has allocated a parameter save area when the caller has not done so.
>>   That means any writes by the callee into the non-existent parameter save
>>   area will corrupt the caller stack frame.
>>
>>   The workaround implemented here, is for the callee to determine whether
>>   the caller has allocated a parameter save area or not, by ignoring any
>>   unused hidden parameters when counting the number of parameters.
>>
>>
> I will address this change in the new version of the patch.
>>
>>> PR rtk-optimization/100799
>>
>> s/rtk/rtl/
>>
>>
>>
> I will address this in new version of the patch.
>>> * config/rs6000/rs6000-calls.cc (rs6000_function_arg): Don't
>>> generate parameter save area if number of arguments passed
>>> less than equal to GP_ARG_NUM_REG (8) excluding hidden
>>> parameter.
>>
>> The callee doesn't generate or allocate the parameter save area, the
>> caller does.  The code here is for the callee trying to determine
>> whether the caller has done so.  How about saying the following instead?
>>
>>   Don't assume a parameter save area has been allocated if the number of
>>   formal parameters, excluding unused hidden parameters, is less than or
>>   equal to GP_ARG_NUM_REG (8).
>>
>>
> 
> I will incorporate this change in new version of the patch.
> 
>>
>>
>>> (init_cumulative_args): Check for hidden parameter in fortran
>>> routine and set the flag hidden_string_length and actual
>>> parameter passed excluding hidden unused DECLS.
>>
>> Check for unused hidden Fortran parameters and set hidden_string_length
>> and actual_parm_length.
>>
>>
> 
> I will address this change in new version of the patch.
> 
>  
>>> +  /* When the buggy C/C++ wrappers call the function with fewer arguments
>>> + than it actually has and doesn't expect the parameter save area on the
>>> + caller side because of that while the callee expects it and the callee
>>> + actually stores something in the parameter save area, it corrupts
>>> + whatever is in the caller stack frame at that location.  */
>>
>> The wrapper/caller is the one that allocates the parameter save area, so
>> saying "...doesn't expect the parameter save area on the caller side..."
>> doesn't make sense, since it knows whether it allocated it or not.
>> How about saying something like the following instead?
>>
>>   Check whether this function contains any unused hidden parameters and
>>   record how many there are for use in rs6000_function_arg() to determine
>>   whether its callers have allocated a parameter save area or not.
>>   See PR100799 for details.
>>
>>
> 
> I will incorporate this change in new version of the patch.
> 
>>
>>> +  unsigned int num_args = 0;
>>> +  unsigned int hidden_length = 0;
>>> +
>>> +  for (tree arg = DECL_ARGUMENTS (current_function_decl);
>>> +   arg; arg = DECL_CHAIN (arg))
>>> +{
>>> +  num_args++;
>>> +  if (DECL_HIDDEN_STRING_LENGTH (arg))
>>> +   {
>>> + tree parmdef = ssa_default_def (cfun, arg);
>>> + if (parmdef == NULL || has_zero_uses (parmdef))
>>> + 

Re: [PATCH v2] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-03-23 Thread Ajit Agarwal



On 23/03/24 9:33 pm, Peter Bergner wrote:
> On 3/23/24 4:33 AM, Ajit Agarwal wrote:
>>>> -  else if (align_words < GP_ARG_NUM_REG)
>>>> +  else if (align_words < GP_ARG_NUM_REG
>>>> + || (cum->hidden_string_length
>>>> + && cum->actual_parm_length <= GP_ARG_NUM_REG))
>>> {
>>>   if (TARGET_32BIT && TARGET_POWERPC64)
>>> return rs6000_mixed_function_arg (mode, type, align_words);
>>>
>>>   return gen_rtx_REG (mode, GP_ARG_MIN_REG + align_words);
>>> }
>>>   else
>>> return NULL_RTX;
>>>
>>> The old code for the unused hidden parameter (which was the 9th param) would
>>> fall thru to the "return NULL_RTX;" which would make the callee assume there
>>> was a parameter save area allocated.  Now instead, we'll return a reg rtx,
>>> probably of r11 (r3 thru r10 are our param regs) and I'm guessing we'll now
>>> see a copy of r11 into a pseudo like we do for the other param regs.
>>> Is that a problem? Given it's an unused parameter, it'll probably get 
>>> deleted
>>> as dead code, but could it cause any issues?  What if we have more than one
>>> unused hidden parameter and we return r12 and r13 which have specific uses
>>> in our ABIs (eg, r13 is our TCB pointer), so it may not actually look dead.
>>> Have you verified what the callee RTL looks like after expand for these
>>> unused hidden parameters?  Is there a rtx we can return that isn't a 
>>> NULL_RTX
>>> which triggers the assumption of a parameter save area, but isn't a reg rtx
>>> which might lead to some rtl being generated?  Would a (const_int 0) or
>>> something else work?
>>>
>>>
>> For the above use case it will return 
>>
>> (reg:DI 5 %r5) and below check entry_parm = 
>> (reg:DI 5 %r5) and the following check will not return TRUE and hence
>>parameter save area will not be allocated.
> 
> Why r5?!?!   The 8th (integer) param would return r10, so I'd assume if
> the next param was a hidden param, then it'd get the next gpr, so r11.
> How does it jump back to r5 which may have been used by the 3rd param?
> 
> 
My mistake its r11 only for hidden param.
> 
> 
> 
>> It will not generate any rtx in the callee rtl code but it just used to
>> check whether to allocate parameter save area or not when number of args > 8.
>>
>> /* If there is no incoming register, we need a stack.  */
>>   entry_parm = rs6000_function_arg (args_so_far, arg);
>>   if (entry_parm == NULL)
>> return true;
>>
>>   /* Likewise if we need to pass both in registers and on the stack.  */
>>   if (GET_CODE (entry_parm) == PARALLEL
>>   && XEXP (XVECEXP (entry_parm, 0, 0), 0) == NULL_RTX)
>> return true;
> 
> Yes, this code in rs6000_parm_needs_stack() uses the rs6000_function_arg()
> return value as a boolean to tell us whether a parameter save area is required
> so what we return is unimportant other than to know it's not NULL_RTX.
> 
> I'm more concerned about the use of the target hook targetm.calls.function_arg
> used in the generic parts of the compiler.  What will that code do differently
> now that we return a reg rtx rather than NULL_RTX?  Might that code use
> the reg rtx to emit something?  I'd feel better if you could verify what
> happens in that code when we return a reg rtx for that 9th hidden param which
> isn't really being passed in a register.
> 

As per my understanding and debugging openBLAS code testcase I see that reg_rtx 
returned inside the below IF condition is used for check whether paramter save 
area is needed or not. 

In the generic code where targetm.calls.function_arg is called 
in calls.cc returned rtx is used for PARALLEL case so that we can
check if we need to pass both in registers and stack then they emit
store with respect to return rtx. If we identify that we need only
registers for argument then it emits nothing.

Thanks & Regards
Ajit
> 
> Peter
> 
> 


[PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

2024-04-05 Thread Ajit Agarwal
Hello Alex/Richard:

All review comments are incorporated.

Common infrastructure of load store pair fusion is divided into target
independent and target dependent changed code.

Target independent code is the Generic code with pure virtual function
to interface betwwen target independent and dependent code.

Target dependent code is the implementation of pure virtual function for
aarch64 target and the call to target independent code.

Thanks & Regards
Ajit


aarch64: Place target independent and dependent changed code in one file

Common infrastructure of load store pair fusion is divided into target
independent and target dependent changed code.

Target independent code is the Generic code with pure virtual function
to interface betwwen target independent and dependent code.

Target dependent code is the implementation of pure virtual function for
aarch64 target and the call to target independent code.

2024-04-06  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/aarch64/aarch64-ldp-fusion.cc: Place target
independent and dependent changed code.
---
 gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++
 1 file changed, 249 insertions(+), 122 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 22ed95eb743..cb21b514ef7 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -138,8 +138,122 @@ struct alt_base
   poly_int64 offset;
 };
 
+// Virtual base class for load/store walkers used in alias analysis.
+struct alias_walker
+{
+  virtual bool conflict_p (int &budget) const = 0;
+  virtual insn_info *insn () const = 0;
+  virtual bool valid () const  = 0;
+  virtual void advance () = 0;
+};
+
+struct pair_fusion {
+
+  pair_fusion () {};
+  virtual bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
+  bool load_p) = 0;
+
+  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
+  virtual bool pair_trailing_writeback_p () = 0;
+  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
+ machine_mode mem_mode) = 0;
+  virtual int pair_mem_alias_check_limit () = 0;
+  virtual bool handle_writeback_opportunities () = 0 ;
+  virtual bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
+   machine_mode mode) = 0;
+  virtual rtx gen_mem_pair (rtx *pats,  rtx writeback,
+   bool load_p) = 0;
+  virtual bool pair_mem_promote_writeback_p (rtx pat) = 0;
+  virtual bool track_load_p () = 0;
+  virtual bool track_store_p () = 0;
+  virtual bool cand_insns_empty_p (std::list &insns) = 0;
+  virtual bool pair_mem_in_range_p (HOST_WIDE_INT off) = 0;
+  void ldp_fusion_bb (bb_info *bb);
+
+  ~pair_fusion() { }
+};
+
+struct aarch64_pair_fusion : public pair_fusion
+{
+public:
+  aarch64_pair_fusion () : pair_fusion () {};
+  bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
+   bool load_p) override final
+  {
+const bool fpsimd_op_p
+  = reload_completed
+  ? (REG_P (reg_op) && FP_REGNUM_P (REGNO (reg_op)))
+  : (GET_MODE_CLASS (mem_mode) != MODE_INT
+&& (load_p || !aarch64_const_zero_rtx_p (reg_op)));
+return fpsimd_op_p;
+  }
+
+  bool pair_mem_promote_writeback_p (rtx pat)
+  {
+ if (reload_completed
+&& aarch64_ldp_writeback > 1
+&& GET_CODE (pat) == PARALLEL
+&& XVECLEN (pat, 0) == 2)
+   return true;
+
+ return false;
+  }
+
+  bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
+   machine_mode mode)
+  {
+return aarch64_mem_ok_with_ldpstp_policy_model (first_mem,
+load_p,
+mode);
+  }
+  bool pair_operand_mode_ok_p (machine_mode mode);
+
+  rtx gen_mem_pair (rtx *pats, rtx writeback, bool load_p);
+
+  bool pair_trailing_writeback_p  ()
+  {
+return aarch64_ldp_writeback > 1;
+  }
+  bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
+ machine_mode mem_mode)
+  {
+return (load_p
+? aarch64_ldp_reg_operand (reg_op, mem_mode)
+: aarch64_stp_reg_operand (reg_op, mem_mode));
+  }
+  int pair_mem_alias_check_limit ()
+  {
+return aarch64_ldp_alias_check_limit;
+  }
+  bool handle_writeback_opportunities ()
+  {
+return aarch64_ldp_writeback;
+  }
+  bool track_load_p ()
+  {
+const bool track_loads
+  = aarch64_tune_params.ldp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
+return track_loads;
+  }
+  bool track_store_p ()
+  {
+const bool track_stores
+  = aarch64_tune_params.stp_policy_model != AARCH64_LDP_STP_POLICY_NEVER;
+return track_stores;
+  }
+  bool cand_insns_empty_p (std::list &insns)
+  {
+return insns.empty();
+  }
+  bool pair_mem_in_range_p (HOST_WIDE_INT off)
+  {
+return (off < LDP_MIN_IMM || off

Re: [PATCH V3 0/2] aarch64: Place target independent and dependent changed code in one file.

2024-04-05 Thread Ajit Agarwal
Hello Alex:

On 03/04/24 8:51 pm, Alex Coplan wrote:
> On 23/02/2024 16:41, Ajit Agarwal wrote:
>> Hello Richard/Alex/Segher:
> 
> Hi Ajit,
> 
> Sorry for the delay and thanks for working on this.
> 
> Generally this looks like the right sort of approach (IMO) but I've left
> some comments below.
> 
> I'll start with a meta comment: in the subject line you have marked this
> as 0/2, but usually 0/n is reserved for the cover letter of a patch
> series and wouldn't contain an actual patch.  I think this might have
> confused the Linaro CI suitably such that it didn't run regression tests
> on the patch.
> 
Sorry for that. I have changed that in latest patch.
>>
>> This patch adds the changed code for target independent and
>> dependent code for load store fusion.
>>
>> Common infrastructure of load store pair fusion is
>> divided into target independent and target dependent
>> changed code.
>>
>> Target independent code is the Generic code with
>> pure virtual function to interface betwwen target
>> independent and dependent code.
>>
>> Target dependent code is the implementation of pure
>> virtual function for aarch64 target and the call
>> to target independent code.
>>
>> Bootstrapped for aarch64-linux-gnu.
>>
>> Thanks & Regards
>> Ajit
>>
>> aarch64: Place target independent and dependent changed code in one file.
>>
>> Common infrastructure of load store pair fusion is
>> divided into target independent and target dependent
>> changed code.
>>
>> Target independent code is the Generic code with
>> pure virtual function to interface betwwen target
>> independent and dependent code.
>>
>> Target dependent code is the implementation of pure
>> virtual function for aarch64 target and the call
>> to target independent code.
>>
>> 2024-02-23  Ajit Kumar Agarwal  
>>
>> gcc/ChangeLog:
>>
>>  * config/aarch64/aarch64-ldp-fusion.cc: Place target
>>  independent and dependent changed code.
>> ---
>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 437 ---
>>  1 file changed, 305 insertions(+), 132 deletions(-)
>>
>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
>> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> index 22ed95eb743..2ef22ff1e96 100644
>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> @@ -40,10 +40,10 @@
>>  
>>  using namespace rtl_ssa;
>>  
>> -static constexpr HOST_WIDE_INT LDP_IMM_BITS = 7;
>> -static constexpr HOST_WIDE_INT LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1));
>> -static constexpr HOST_WIDE_INT LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1;
>> -static constexpr HOST_WIDE_INT LDP_MIN_IMM = -LDP_MAX_IMM - 1;
>> +static constexpr HOST_WIDE_INT PAIR_MEM_IMM_BITS = 7;
>> +static constexpr HOST_WIDE_INT PAIR_MEM_IMM_SIGN_BIT = (1 << 
>> (PAIR_MEM_IMM_BITS - 1));
>> +static constexpr HOST_WIDE_INT PAIR_MEM_MAX_IMM = PAIR_MEM_IMM_SIGN_BIT - 1;
>> +static constexpr HOST_WIDE_INT PAIR_MEM_MIN_IMM = -PAIR_MEM_MAX_IMM - 1;
> 
> These constants shouldn't be renamed: they are specific to aarch64 so the
> original names should be preserved in this file.
> 
> I expect we want to introduce virtual functions to validate an offset
> for a paired access.  The aarch64 code could then implement it by
> comparing the offset against LDP_{MIN,MAX}_IMM, and the generic code
> could use that hook to replace the code that queries those constants
> directly (i.e. in find_trailing_add and get_viable_bases).
> 
>>  
>>  // We pack these fields (load_p, fpsimd_p, and size) into an integer
>>  // (LFS) which we use as part of the key into the main hash tables.
>> @@ -138,8 +138,18 @@ struct alt_base
>>poly_int64 offset;
>>  };
>>  
>> +// Virtual base class for load/store walkers used in alias analysis.
>> +struct alias_walker
>> +{
>> +  virtual bool conflict_p (int &budget) const = 0;
>> +  virtual insn_info *insn () const = 0;
>> +  virtual bool valid () const  = 0;
>> +  virtual void advance () = 0;
>> +};
>> +
>> +
>>  // State used by the pass for a given basic block.
>> -struct ldp_bb_info
>> +struct pair_fusion
> 
> As a comment on the high-level design, I think we want a generic class
> for the overall pass, not just for the BB-specific structure.
> 
> That is because naturally we want the ldp_fusion_bb function itself to
> be a member of such a class, so that it can access virtual functi

[PATCH] aarch64: Preparatory patch to place target independent and dependent changed code in one file

2024-04-09 Thread Ajit Agarwal
Hello Alex/Richard:

All review comments are addressed.

Common infrastructure of load store pair fusion is divided into target
independent and target dependent changed code.

Target independent code is the Generic code with pure virtual function
to interface betwwen target independent and dependent code.

Target dependent code is the implementation of pure virtual function for
aarch64 target and the call to target independent code.

Thanks & Regards
Ajit

aarch64: Place target independent and dependent changed code in one file

Common infrastructure of load store pair fusion is divided into target
independent and target dependent changed code.

Target independent code is the Generic code with pure virtual function
to interface betwwen target independent and dependent code.

Target dependent code is the implementation of pure virtual function for
aarch64 target and the call to target independent code.

2024-04-09  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/aarch64/aarch64-ldp-fusion.cc: Place target
independent and dependent changed code.
---
 gcc/config/aarch64/aarch64-ldp-fusion.cc | 493 +++
 1 file changed, 333 insertions(+), 160 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 365dcf48b22..f840432d32a 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -138,6 +138,196 @@ struct alt_base
   poly_int64 offset;
 };
 
+// Virtual base class for load/store walkers used in alias analysis.
+struct alias_walker
+{
+  virtual bool conflict_p (int &budget) const = 0;
+  virtual insn_info *insn () const = 0;
+  virtual bool valid () const = 0;
+  virtual void advance () = 0;
+};
+
+// Forward declaration to be used inside the aarch64_pair_fusion class.
+bool ldp_operand_mode_ok_p (machine_mode mode);
+rtx aarch64_destructure_load_pair (rtx regs[2], rtx pattern);
+rtx aarch64_destructure_store_pair (rtx regs[2], rtx pattern);
+rtx aarch64_gen_writeback_pair (rtx wb_effect, rtx pair_mem, rtx regs[2],
+   bool load_p);
+enum class writeback{
+  WRITEBACK_PAIR_P,
+  WRITEBACK
+};
+
+struct pair_fusion {
+
+  pair_fusion ()
+  {
+calculate_dominance_info (CDI_DOMINATORS);
+df_analyze ();
+crtl->ssa = new rtl_ssa::function_info (cfun);
+  };
+  // Return true if GPR is FP or SIMD accesses, passed
+  // with GPR reg_op rtx, machine mode and load_p.
+  virtual bool fpsimd_op_p (rtx, machine_mode, bool)
+  {
+return false;
+  }
+  // Return true if pair operand mode is ok. Passed with
+  // machine mode.
+  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
+  // Return true if reg operand is ok, passed with load_p,
+  // reg_op rtx and machine mode.
+  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
+ machine_mode mem_mode) = 0;
+  // Return alias check limit.
+  virtual int pair_mem_alias_check_limit () = 0;
+  // Return true if there is writeback opportunities. Passed
+  // with enum writeback.
+  virtual bool handle_writeback_opportunities (enum writeback wback) = 0 ;
+  // Return true if mem ok ldp stp policy model passed with
+  // rtx mem, load_p and machine mode.
+  virtual bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
+   machine_mode mode) = 0;
+  // Gen load store mem pair. Return load store rtx passed
+  // with arguments load store pattern, writeback rtx and
+  // load_p.
+  virtual rtx gen_mem_pair (rtx *pats, rtx writeback,
+   bool load_p) = 0;
+  // Return true if memory writeback can be promoted, passed
+  // with insn, rtx pattern and load_p. load_p is set by this
+  // hook.
+  virtual bool pair_mem_promote_writeback_p (insn_info *, rtx, bool &)
+  {
+ return false;
+  }
+  // Return true if we track loads.
+  virtual bool track_loads_p ()
+  {
+return true;
+  }
+  // Return true if we track stores.
+  virtual bool track_stores_p ()
+  {
+return true;
+  }
+  // Return true if insns is empty, passed insns list.
+  // This is required to set skipping of load store pairs.
+  virtual bool cand_insns_empty_p (std::list &insns) = 0;
+  // Return true if offset is out of range.
+  virtual bool pair_mem_out_of_range_p (HOST_WIDE_INT off) = 0;
+  // Return destructure pair. Passed with rtx reg, insn pattern
+  // and load_p.
+  virtual rtx gen_destructure_pair (rtx regs[2], rtx rti, bool load_p) = 0;
+  // Return writeback pair. Passed with rtx writeback effect, mem rtx
+  // regs rtx and load_p.
+  virtual rtx gen_writeback_pair (rtx wb_effect, rtx mem,
+ rtx regs[2], bool load_p) = 0;
+  void ldp_fusion_bb (bb_info *bb);
+  insn_info * find_trailing_add (insn_info *insns[2],
+const insn_range_info &pair_range,
+int initial_writeback,
+rtx *writ

Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

2024-04-09 Thread Ajit Agarwal



On 05/04/24 10:03 pm, Alex Coplan wrote:
> On 05/04/2024 13:53, Ajit Agarwal wrote:
>> Hello Alex/Richard:
>>
>> All review comments are incorporated.
> 
> Thanks, I was kind-of expecting you to also send the renaming patch as a
> preparatory patch as we discussed.
> 
> Sorry for another meta comment, but: I think the reason that the Linaro
> CI isn't running tests on your patches is actually because you're
> sending 1/3 of a series but not sending the rest of the series.
> 
> So please can you either send this as an individual preparatory patch
> (not marked as a series) or if you're going to send a series (e.g. with
> a preparatory rename patch as 1/2 and this as 2/2) then send the entire
> series when you make updates.  That way the CI should test your patches,
> which would be helpful.
>

Addressed.
 
>>
>> Common infrastructure of load store pair fusion is divided into target
>> independent and target dependent changed code.
>>
>> Target independent code is the Generic code with pure virtual function
>> to interface betwwen target independent and dependent code.
>>
>> Target dependent code is the implementation of pure virtual function for
>> aarch64 target and the call to target independent code.
>>
>> Thanks & Regards
>> Ajit
>>
>>
>> aarch64: Place target independent and dependent changed code in one file
>>
>> Common infrastructure of load store pair fusion is divided into target
>> independent and target dependent changed code.
>>
>> Target independent code is the Generic code with pure virtual function
>> to interface betwwen target independent and dependent code.
>>
>> Target dependent code is the implementation of pure virtual function for
>> aarch64 target and the call to target independent code.
>>
>> 2024-04-06  Ajit Kumar Agarwal  
>>
>> gcc/ChangeLog:
>>
>>  * config/aarch64/aarch64-ldp-fusion.cc: Place target
>>  independent and dependent changed code.
> 
> You're going to need a proper ChangeLog eventually, but I guess there's
> no need for that right now.
> 
>> ---
>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++
>>  1 file changed, 249 insertions(+), 122 deletions(-)
>>
>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
>> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> index 22ed95eb743..cb21b514ef7 100644
>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> @@ -138,8 +138,122 @@ struct alt_base
>>poly_int64 offset;
>>  };
>>  
>> +// Virtual base class for load/store walkers used in alias analysis.
>> +struct alias_walker
>> +{
>> +  virtual bool conflict_p (int &budget) const = 0;
>> +  virtual insn_info *insn () const = 0;
>> +  virtual bool valid () const  = 0;
> 
> Heh, looking at this made me realise there is a whitespace bug here in
> the existing code (double space after const).  Sorry about that!  I'll
> push an obvious fix for that.
> 
>> +  virtual void advance () = 0;
>> +};
>> +
>> +struct pair_fusion {
>> +
>> +  pair_fusion () {};
> 
> This ctor looks pointless at the moment.  Perhaps instead we could put
> the contents of ldp_fusion_init in here and then delete that function?
> 

Addressed.

>> +  virtual bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
>> +   bool load_p) = 0;
> 
> Please can we have comments above each of these virtual functions
> describing any parameters, what the purpose of the hook is, and the
> interpretation of the return value?  This will serve as the
> documentation for other targets that want to make use of the pass.
> 
> It might make sense to have a default-false implementation for
> fpsimd_op_p, especially if you don't want to make use of this bit for
> rs6000.
>

Addressed.
 
>> +
>> +  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
>> +  virtual bool pair_trailing_writeback_p () = 0;
> 
> Sorry for the run-around, but: I think this and
> handle_writeback_opportunities () should be the same function, either
> returning an enum or taking an enum and returning a boolean.
> 
> At a minimum they should have more similar sounding names.
> 

Addressed.

>> +  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
>> +  machine_mode mem_mode) = 0;
>> +  virtual int pair_mem_alias_check_limit () = 0;
>> +  virtual bool handle_writeback_opportunities () = 0 ;
>> +  virtual bool pair_mem_ok_

Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

2024-04-09 Thread Ajit Agarwal
Hello Alex:

On 09/04/24 7:29 pm, Alex Coplan wrote:
> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>
>>
>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>> Hello Alex/Richard:
>>>>
>>>> All review comments are incorporated.
>>>
>>> Thanks, I was kind-of expecting you to also send the renaming patch as a
>>> preparatory patch as we discussed.
>>>
>>> Sorry for another meta comment, but: I think the reason that the Linaro
>>> CI isn't running tests on your patches is actually because you're
>>> sending 1/3 of a series but not sending the rest of the series.
>>>
>>> So please can you either send this as an individual preparatory patch
>>> (not marked as a series) or if you're going to send a series (e.g. with
>>> a preparatory rename patch as 1/2 and this as 2/2) then send the entire
>>> series when you make updates.  That way the CI should test your patches,
>>> which would be helpful.
>>>
>>
>> Addressed.
>>  
>>>>
>>>> Common infrastructure of load store pair fusion is divided into target
>>>> independent and target dependent changed code.
>>>>
>>>> Target independent code is the Generic code with pure virtual function
>>>> to interface betwwen target independent and dependent code.
>>>>
>>>> Target dependent code is the implementation of pure virtual function for
>>>> aarch64 target and the call to target independent code.
>>>>
>>>> Thanks & Regards
>>>> Ajit
>>>>
>>>>
>>>> aarch64: Place target independent and dependent changed code in one file
>>>>
>>>> Common infrastructure of load store pair fusion is divided into target
>>>> independent and target dependent changed code.
>>>>
>>>> Target independent code is the Generic code with pure virtual function
>>>> to interface betwwen target independent and dependent code.
>>>>
>>>> Target dependent code is the implementation of pure virtual function for
>>>> aarch64 target and the call to target independent code.
>>>>
>>>> 2024-04-06  Ajit Kumar Agarwal  
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>>* config/aarch64/aarch64-ldp-fusion.cc: Place target
>>>>independent and dependent changed code.
>>>
>>> You're going to need a proper ChangeLog eventually, but I guess there's
>>> no need for that right now.
>>>
>>>> ---
>>>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++
>>>>  1 file changed, 249 insertions(+), 122 deletions(-)
>>>>
>>>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
>>>> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>> index 22ed95eb743..cb21b514ef7 100644
>>>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>> @@ -138,8 +138,122 @@ struct alt_base
>>>>poly_int64 offset;
>>>>  };
>>>>  
>>>> +// Virtual base class for load/store walkers used in alias analysis.
>>>> +struct alias_walker
>>>> +{
>>>> +  virtual bool conflict_p (int &budget) const = 0;
>>>> +  virtual insn_info *insn () const = 0;
>>>> +  virtual bool valid () const  = 0;
>>>
>>> Heh, looking at this made me realise there is a whitespace bug here in
>>> the existing code (double space after const).  Sorry about that!  I'll
>>> push an obvious fix for that.
>>>
>>>> +  virtual void advance () = 0;
>>>> +};
>>>> +
>>>> +struct pair_fusion {
>>>> +
>>>> +  pair_fusion () {};
>>>
>>> This ctor looks pointless at the moment.  Perhaps instead we could put
>>> the contents of ldp_fusion_init in here and then delete that function?
>>>
>>
>> Addressed.
>>
>>>> +  virtual bool fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
>>>> + bool load_p) = 0;
>>>
>>> Please can we have comments above each of these virtual functions
>>> describing any parameters, what the purpose of the hook is, and the
>>> interpretation of the return value?  This will serve as the
>>> documentation for other targets that want to make use of the pass.
>>>

Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

2024-04-09 Thread Ajit Agarwal
Hello Alex:

On 09/04/24 8:39 pm, Alex Coplan wrote:
> On 09/04/2024 20:01, Ajit Agarwal wrote:
>> Hello Alex:
>>
>> On 09/04/24 7:29 pm, Alex Coplan wrote:
>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>>>
>>>>
>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>>>> Hello Alex/Richard:
>>>>>>
>>>>>> All review comments are incorporated.
>>>>>
>>>>> Thanks, I was kind-of expecting you to also send the renaming patch as a
>>>>> preparatory patch as we discussed.
>>>>>
>>>>> Sorry for another meta comment, but: I think the reason that the Linaro
>>>>> CI isn't running tests on your patches is actually because you're
>>>>> sending 1/3 of a series but not sending the rest of the series.
>>>>>
>>>>> So please can you either send this as an individual preparatory patch
>>>>> (not marked as a series) or if you're going to send a series (e.g. with
>>>>> a preparatory rename patch as 1/2 and this as 2/2) then send the entire
>>>>> series when you make updates.  That way the CI should test your patches,
>>>>> which would be helpful.
>>>>>
>>>>
>>>> Addressed.
>>>>  
>>>>>>
>>>>>> Common infrastructure of load store pair fusion is divided into target
>>>>>> independent and target dependent changed code.
>>>>>>
>>>>>> Target independent code is the Generic code with pure virtual function
>>>>>> to interface betwwen target independent and dependent code.
>>>>>>
>>>>>> Target dependent code is the implementation of pure virtual function for
>>>>>> aarch64 target and the call to target independent code.
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Ajit
>>>>>>
>>>>>>
>>>>>> aarch64: Place target independent and dependent changed code in one file
>>>>>>
>>>>>> Common infrastructure of load store pair fusion is divided into target
>>>>>> independent and target dependent changed code.
>>>>>>
>>>>>> Target independent code is the Generic code with pure virtual function
>>>>>> to interface betwwen target independent and dependent code.
>>>>>>
>>>>>> Target dependent code is the implementation of pure virtual function for
>>>>>> aarch64 target and the call to target independent code.
>>>>>>
>>>>>> 2024-04-06  Ajit Kumar Agarwal  
>>>>>>
>>>>>> gcc/ChangeLog:
>>>>>>
>>>>>>  * config/aarch64/aarch64-ldp-fusion.cc: Place target
>>>>>>  independent and dependent changed code.
>>>>>
>>>>> You're going to need a proper ChangeLog eventually, but I guess there's
>>>>> no need for that right now.
>>>>>
>>>>>> ---
>>>>>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++
>>>>>>  1 file changed, 249 insertions(+), 122 deletions(-)
>>>>>>
>>>>>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
>>>>>> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>>>> index 22ed95eb743..cb21b514ef7 100644
>>>>>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>>>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>>>> @@ -138,8 +138,122 @@ struct alt_base
>>>>>>poly_int64 offset;
>>>>>>  };
>>>>>>  
>>>>>> +// Virtual base class for load/store walkers used in alias analysis.
>>>>>> +struct alias_walker
>>>>>> +{
>>>>>> +  virtual bool conflict_p (int &budget) const = 0;
>>>>>> +  virtual insn_info *insn () const = 0;
>>>>>> +  virtual bool valid () const  = 0;
>>>>>
>>>>> Heh, looking at this made me realise there is a whitespace bug here in
>>>>> the existing code (double space after const).  Sorry about that!  I'll
>>>>> push an obvious fix for that.
>>>>>
>>>>>> +  virtual void advance () = 0;
>>>>>> +};
>&

Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

2024-04-10 Thread Ajit Agarwal
Hello Alex:

On 10/04/24 1:42 pm, Alex Coplan wrote:
> Hi Ajit,
> 
> On 09/04/2024 20:59, Ajit Agarwal wrote:
>> Hello Alex:
>>
>> On 09/04/24 8:39 pm, Alex Coplan wrote:
>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
>>>> Hello Alex:
>>>>
>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>>>>>
>>>>>>
>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>>>>>> Hello Alex/Richard:
>>>>>>>>
>>>>>>>> All review comments are incorporated.
>>>>>>>
>>>>>>> Thanks, I was kind-of expecting you to also send the renaming patch as a
>>>>>>> preparatory patch as we discussed.
>>>>>>>
>>>>>>> Sorry for another meta comment, but: I think the reason that the Linaro
>>>>>>> CI isn't running tests on your patches is actually because you're
>>>>>>> sending 1/3 of a series but not sending the rest of the series.
>>>>>>>
>>>>>>> So please can you either send this as an individual preparatory patch
>>>>>>> (not marked as a series) or if you're going to send a series (e.g. with
>>>>>>> a preparatory rename patch as 1/2 and this as 2/2) then send the entire
>>>>>>> series when you make updates.  That way the CI should test your patches,
>>>>>>> which would be helpful.
>>>>>>>
>>>>>>
>>>>>> Addressed.
>>>>>>  
>>>>>>>>
>>>>>>>> Common infrastructure of load store pair fusion is divided into target
>>>>>>>> independent and target dependent changed code.
>>>>>>>>
>>>>>>>> Target independent code is the Generic code with pure virtual function
>>>>>>>> to interface betwwen target independent and dependent code.
>>>>>>>>
>>>>>>>> Target dependent code is the implementation of pure virtual function 
>>>>>>>> for
>>>>>>>> aarch64 target and the call to target independent code.
>>>>>>>>
>>>>>>>> Thanks & Regards
>>>>>>>> Ajit
>>>>>>>>
>>>>>>>>
>>>>>>>> aarch64: Place target independent and dependent changed code in one 
>>>>>>>> file
>>>>>>>>
>>>>>>>> Common infrastructure of load store pair fusion is divided into target
>>>>>>>> independent and target dependent changed code.
>>>>>>>>
>>>>>>>> Target independent code is the Generic code with pure virtual function
>>>>>>>> to interface betwwen target independent and dependent code.
>>>>>>>>
>>>>>>>> Target dependent code is the implementation of pure virtual function 
>>>>>>>> for
>>>>>>>> aarch64 target and the call to target independent code.
>>>>>>>>
>>>>>>>> 2024-04-06  Ajit Kumar Agarwal  
>>>>>>>>
>>>>>>>> gcc/ChangeLog:
>>>>>>>>
>>>>>>>>* config/aarch64/aarch64-ldp-fusion.cc: Place target
>>>>>>>>independent and dependent changed code.
>>>>>>>
>>>>>>> You're going to need a proper ChangeLog eventually, but I guess there's
>>>>>>> no need for that right now.
>>>>>>>
>>>>>>>> ---
>>>>>>>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 371 +++
>>>>>>>>  1 file changed, 249 insertions(+), 122 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
>>>>>>>> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>>>>>> index 22ed95eb743..cb21b514ef7 100644
>>>>>>>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>>>>>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>>>>>>>> @@ -138,8 +138,122 @@ struct

[PATCH v1] aarch64: Preparatory Patch to place target independent and dependent changed code in one file

2024-04-10 Thread Ajit Agarwal
Hello Alex/Richard:

All comments are addressed in this version-1 of the patch.

Common infrastructure of load store pair fusion is divded into target
independent and target dependent changed code.

Target independent code is the Generic code with pure virtual function
to interface betwwen target independent and dependent code.

Target dependent code is the implementation of pure virtual function for
aarch64 target and the call to target independent code.

Thanks & Regards
Ajit


aarch64: Place target independent and dependent changed code in one file

Common infrastructure of load store pair fusion is divided into target
independent and target dependent changed code.

Target independent code is the Generic code with pure virtual function
to interface betwwen target independent and dependent code.

Target dependent code is the implementation of pure virtual function for
aarch64 target and the call to target independent code.

2024-04-10  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/aarch64/aarch64-ldp-fusion.cc: Place target
independent and dependent changed code.
---
 gcc/config/aarch64/aarch64-ldp-fusion.cc | 497 +++
 1 file changed, 337 insertions(+), 160 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 365dcf48b22..03e8572ebfd 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -138,6 +138,198 @@ struct alt_base
   poly_int64 offset;
 };
 
+// Virtual base class for load/store walkers used in alias analysis.
+struct alias_walker
+{
+  virtual bool conflict_p (int &budget) const = 0;
+  virtual insn_info *insn () const = 0;
+  virtual bool valid () const = 0;
+  virtual void advance () = 0;
+};
+
+// Forward declaration to be used inside the aarch64_pair_fusion class.
+bool ldp_operand_mode_ok_p (machine_mode mode);
+rtx aarch64_destructure_load_pair (rtx regs[2], rtx pattern);
+rtx aarch64_destructure_store_pair (rtx regs[2], rtx pattern);
+rtx aarch64_gen_writeback_pair (rtx wb_effect, rtx pair_mem, rtx regs[2],
+   bool load_p);
+enum class writeback{
+  WRITEBACK_PAIR_P,
+  WRITEBACK
+};
+
+struct pair_fusion {
+
+  pair_fusion ()
+  {
+calculate_dominance_info (CDI_DOMINATORS);
+df_analyze ();
+crtl->ssa = new rtl_ssa::function_info (cfun);
+  };
+  // Return true if GPR is FP or SIMD accesses, passed
+  // with GPR reg_op rtx, machine mode and load_p.
+  virtual bool fpsimd_op_p (rtx, machine_mode, bool)
+  {
+return false;
+  }
+  // Return true if pair operand mode is ok. Passed with
+  // machine mode.
+  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
+  // Return true if reg operand is ok, passed with load_p,
+  // reg_op rtx and machine mode.
+  virtual bool pair_reg_operand_ok_p (bool load_p, rtx reg_op,
+ machine_mode mem_mode) = 0;
+  // Return alias check limit.
+  virtual int pair_mem_alias_check_limit () = 0;
+  // Return true if there is writeback opportunities. Passed
+  // with enum writeback.
+  virtual bool handle_writeback_opportunities (enum writeback wback) = 0 ;
+  // Return true if mem ok ldp stp policy model passed with
+  // rtx mem, load_p and machine mode.
+  virtual bool pair_mem_ok_with_policy (rtx first_mem, bool load_p,
+   machine_mode mode) = 0;
+  // Gen load store mem pair. Return load store rtx passed
+  // with arguments load store pattern, writeback rtx and
+  // load_p.
+  virtual rtx gen_mem_pair (rtx *pats, rtx writeback,
+   bool load_p) = 0;
+  // Return true if memory writeback can be promoted, passed
+  // with insn, rtx pattern and load_p. load_p is set by this
+  // hook.
+  virtual bool pair_mem_promote_writeback_p (insn_info *, rtx, bool &)
+  {
+ return false;
+  }
+  // Return true if we track loads.
+  virtual bool track_loads_p ()
+  {
+return true;
+  }
+  // Return true if we track stores.
+  virtual bool track_stores_p ()
+  {
+return true;
+  }
+  // Return true if offset is out of range.
+  virtual bool pair_mem_out_of_range_p (HOST_WIDE_INT off) = 0;
+  // Return destructure pair. Passed with rtx reg, insn pattern
+  // and load_p.
+  virtual rtx gen_destructure_pair (rtx regs[2], rtx rti, bool load_p) = 0;
+  // Return writeback pair. Passed with rtx writeback effect, mem rtx
+  // regs rtx and load_p.
+  virtual rtx gen_writeback_pair (rtx wb_effect, rtx mem,
+ rtx regs[2], bool load_p) = 0;
+  // Return true if offset is aligned and multiple of 32.
+  // Passed with offset and access_size to check multiple of 32.
+  virtual bool pair_offset_alignment_ok_p (poly_int64 offset,
+  unsigned access_size) = 0;
+  void ldp_fusion_bb (bb_info *bb);
+  insn_info * find_trailing_add (insn_info *insns[2],
+const insn_range_info &pai

Re: [PATCH V4 1/3] aarch64: Place target independent and dependent changed code in one file

2024-04-10 Thread Ajit Agarwal
Hello Alex:

On 10/04/24 7:52 pm, Alex Coplan wrote:
> Hi Ajit,
> 
> On 10/04/2024 15:31, Ajit Agarwal wrote:
>> Hello Alex:
>>
>> On 10/04/24 1:42 pm, Alex Coplan wrote:
>>> Hi Ajit,
>>>
>>> On 09/04/2024 20:59, Ajit Agarwal wrote:
>>>> Hello Alex:
>>>>
>>>> On 09/04/24 8:39 pm, Alex Coplan wrote:
>>>>> On 09/04/2024 20:01, Ajit Agarwal wrote:
>>>>>> Hello Alex:
>>>>>>
>>>>>> On 09/04/24 7:29 pm, Alex Coplan wrote:
>>>>>>> On 09/04/2024 17:30, Ajit Agarwal wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/04/24 10:03 pm, Alex Coplan wrote:
>>>>>>>>> On 05/04/2024 13:53, Ajit Agarwal wrote:
>>>>>>>>>> Hello Alex/Richard:
>>>>>>>>>>
>>>>>>>>>> All review comments are incorporated.
> 
>>>>>>>>>> @@ -2890,8 +3018,8 @@ ldp_bb_info::merge_pairs (insn_list_t 
>>>>>>>>>> &left_list,
>>>>>>>>>>  // of accesses.  If we find two sets of adjacent accesses, call
>>>>>>>>>>  // merge_pairs.
>>>>>>>>>>  void
>>>>>>>>>> -ldp_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>> - access_group &group)
>>>>>>>>>> +pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>>>>>>>>>> + access_group &group)
>>>>>>>>>>  {
>>>>>>>>>>const auto lfs = decode_lfs (encoded_lfs);
>>>>>>>>>>const unsigned access_size = lfs.size;
>>>>>>>>>> @@ -2909,7 +3037,7 @@ ldp_bb_info::transform_for_base (int 
>>>>>>>>>> encoded_lfs,
>>>>>>>>>> access.cand_insns,
>>>>>>>>>> lfs.load_p,
>>>>>>>>>> access_size);
>>>>>>>>>> -  skip_next = access.cand_insns.empty ();
>>>>>>>>>> +  skip_next = bb_state->cand_insns_empty_p (access.cand_insns);
>>>>>>>>>
>>>>>>>>> As above, why is this needed?
>>>>>>>>
>>>>>>>> For rs6000 we want to return always true. as load store pair
>>>>>>>> that are to be merged with 8/16 16/32 32/64 is occuring for rs6000.
>>>>>>>> And we want load store pair to 8/16 32/64. Thats why we want
>>>>>>>> to generate always true for rs6000 to skip pairs as above.
>>>>>>>
>>>>>>> Hmm, sorry, I'm not sure I follow.  Are you saying that for rs6000 you 
>>>>>>> have
>>>>>>> load/store pair instructions where the two arms of the access are 
>>>>>>> storing
>>>>>>> operands of different sizes?  Or something else?
>>>>>>>
>>>>>>> As it stands the logic is to skip the next iteration only if we
>>>>>>> exhausted all the candidate insns for the current access.  In the case
>>>>>>> that we didn't exhaust all such candidates, then the idea is that when
>>>>>>> access becomes prev_access, we can attempt to use those candidates as
>>>>>>> the "left-hand side" of a pair in the next iteration since we failed to
>>>>>>> use them as the "right-hand side" of a pair in the current iteration.
>>>>>>> I don't see why you wouldn't want that behaviour.  Please can you
>>>>>>> explain?
>>>>>>>
>>>>>>
>>>>>> In merge_pair we get the 2 load candiates one load from 0 offset and
>>>>>> other load is from 16th offset. Then in next iteration we get load
>>>>>> from 16th offset and other load from 32 offset. In next iteration
>>>>>> we get load from 32 offset and other load from 48 offset.
>>>>>>
>>>>>> For example:
>>>>>>
>>>>>> Currently we get the load candiates as follows.
>>>>>>
>>>>>> pairs:
>>>>>>
>>>>>> load from 0

Re: [PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).

2024-02-14 Thread Ajit Agarwal
Hello Alex:

On 24/01/24 10:13 pm, Alex Coplan wrote:
> Hi Ajit,
> 
> On 21/01/2024 19:57, Ajit Agarwal wrote:
>>
>> Hello All:
>>
>> New pass to replace adjacent memory addresses lxv with lxvp.
>> Added common infrastructure for load store fusion for
>> different targets.
> 
> Thanks for this, it would be nice to see the load/store pair pass
> generalized to multiple targets.
> 
> I assume you are targeting GCC 15 for this, as we are in stage 4 at
> the moment?
> 
>>
>> Common routines are refactored in fusion-common.h.
>>
>> AARCH64 load/store fusion pass is not changed with the 
>> common infrastructure.
> 
> I think any patch to generalize the load/store pair fusion pass should
> update the aarch64 code at the same time to use the generic
> infrastructure, instead of duplicating the code.
> 
> As a general comment, I think we should move as much of the code as
> possible to target-independent code, with only the bits that are truly
> target-specific (e.g. deciding which modes to allow for a load/store
> pair operand) in target code.
> 
> In terms of structuring the interface between generic code and target
> code, I think it would be pragmatic to use a class with (in some cases,
> pure) virtual functions that can be overriden by targets to implement
> any target-specific behaviour.
> 
> IMO the generic class should be implemented in its own .cc instead of
> using a header-only approach.  The target code would then define a
> derived class which overrides the virtual functions (where necessary)
> declared in the generic class, and then instantiate the derived class to
> create a target-customized instance of the pass.

Incorporated the above comments in the recent patch sent.
> 
> A more traditional GCC approach would be to use optabs and target hooks
> to customize the behaviour of the pass to handle target-specific
> aspects, but:
>  - Target hooks are quite heavyweight, and we'd potentially have to add
>quite a few hooks just for one pass that (at least initially) will
>only be used by a couple of targets.
>  - Using classes allows both sides to easily maintain their own state
>and share that state where appropriate.
> 
> Nit on naming: I understand you want to move away from ldp_fusion, but
> how about pair_fusion or mem_pair_fusion instead of just "fusion" as a
> base name?  IMO just "fusion" isn't very clear as to what the pass is
> trying to achieve.
> 

I have made it pair_fusion.

> In general the code could do with a lot more commentary to explain the
> rationale for various things / explain the high-level intent of the
> code.
> 
> Unfortunately I'm not familiar with the DF framework (I've only really
> worked with RTL-SSA for the aarch64 pass), so I haven't commented on the
> use of that framework, but it would be nice if what you're trying to do
> could be done using RTL-SSA instead of using DF directly.
>

I have used rtl-ssa DEF-USE at many places in the recent patch.
But DF framework is useful as it populates a pointer rtx through
DF_REF_LOC and then we can easily modify. This is missing in
rtl-ssa pass and wherever LOC is required to change I have 
used DF framework in the recent patch.

 
> Hopefully Richard S can chime in on those aspects.
> 
> My main concerns with the patch at the moment (apart from the code
> duplication) is that it looks like:
> 
>  - The patch removes alias analysis from try_fuse_pair, which is unsafe.
>  - The patch tries to make its own RTL changes inside
>rs6000_gen_load_pair, but it should let fuse_pair make those changes
>using RTL-SSA instead.
>

My mistake that I have remove alias analysis from try_fuse_pair.
In recent patch I kept all the code in the aarch64-ldp-fusion
intact except organizing the generic and target dependent code
through pure virtual functions.
 
> I've left some more specific (but still mostly high-level) comments below.
> 
>>
>> For AARCH64 architectures just include "fusion-common.h"
>> and target dependent code can be added to that.
>>
>>
>> Alex/Richard:
>>
>> If you would like me to add for AARCH64 I can do that for AARCH64.
>>
>> If you would like to do that is fine with me.
>>
>> Bootstrapped and regtested with powerpc64-linux-gnu.
>>
>> Improvement in performance is seen with Spec 2017 spec FP benchmarks.
>>
>> Thanks & Regards
>> Ajit
>>
>> rs6000: New  pass for replacement of adjacent lxv with lxvp.
> 
> Are you looking to handle stores eventually, out of interest?  Looking
> at rs6000-vecload-opt.cc:fusion_bb it looks like you're just handling
>

Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target

2024-02-14 Thread Ajit Agarwal
Hello Richard:


On 14/02/24 4:03 pm, Richard Sandiford wrote:
> Hi,
> 
> Thanks for working on this.
> 
> You posted a version of this patch on Sunday too.  If you need to repost
> to fix bugs or make other improvements, could you describe the changes
> that you've made since the previous version?  It makes things easier
> to follow.

Sure. Sorry for that I forgot to add that.

> 
> Also, sorry for starting with a meta discussion about reviews, but
> there are multiple types of review comment, including:
> 
> (1) Suggestions for changes that are worded as suggestions.
> 
> (2) Suggestions for changes that are worded as questions ("Wouldn't it be
> better to do X?", etc).
> 
> (3) Questions asking for an explanation or for more information.
> 
> Just sending a new patch makes sense when the previous review comments
> were all like (1), and arguably also (1)+(2).  But Alex's previous review
> included (3) as well.  Could you go back and respond to his questions there?
> It would help understand some of the design choices.
>

I have responded to Alex comments for the previous patches.
I have incorporated all of his comments in this patch.

 
> A natural starting point when reviewing a patch like this is to diff
> the current aarch64-ldp-fusion.cc with the new pair-fusion.cc.  This shows
> many of the kind of changes that I'd expect.  But it also seems to include
> some code reordering, such as putting fuse_pair after try_fuse_pair.
> If some reordering is necessary, could you try to organise the patch as
> a series in which the reordering is a separate step?  It's a bit hard
> to review at the moment.  (Reordering for cosmetic reasons is also OK,
> but again please separate it out for ease of review.)
> 
> Maybe one way of making the review easier would be to split the aarch64
> pass into the "target-dependent" and "target-independent" pieces
> in-place, i.e. keeping everything within aarch64-ldp-fusion.cc, and then
> (as separate patches) move the target-independent pieces outside
> config/aarch64.
> 
Sure I will do that.

> The patch includes:
> 
>>  * emit-rtl.cc: Modify ge with gt on PolyINT data structure.
>>  * dce.cc: Add changes not to delete the load store pair.
>>  * rtl-ssa/changes.cc: Modified assert code.
>>  * var-tracking.cc: Modified assert code.
>>  * df-problems.cc: Not to generate REG_UNUSED for multi
>>  word registers that is requied for rs6000 target.
> 
> Please submit these separately, as independent preparatory patches,
> with an explanation for why they're needed & correct.  But:
> 
Sure I will do that.

>> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
>> index 88ee0dd67fc..a8d0ee7c4db 100644
>> --- a/gcc/df-problems.cc
>> +++ b/gcc/df-problems.cc
>> @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>> df_mw_hardreg *mws,
>>if (df_whole_mw_reg_unused_p (mws, live, artificial_uses))
>>  {
>>unsigned int regno = mws->start_regno;
>> -  df_set_note (REG_UNUSED, insn, mws->mw_reg);
>> +  //df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>dead_debug_insert_temp (debug, regno, insn, 
>> DEBUG_TEMP_AFTER_WITH_REG);
>>  
>>if (REG_DEAD_DEBUGGING)
>> @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>> df_mw_hardreg *mws,
>>  if (!bitmap_bit_p (live, r)
>>  && !bitmap_bit_p (artificial_uses, r))
>>{
>> -df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>> +   // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>>  dead_debug_insert_temp (debug, r, insn, DEBUG_TEMP_AFTER_WITH_REG);
>>  if (REG_DEAD_DEBUGGING)
>>df_print_note ("adding 2: ", insn, REG_NOTES (insn));
>> @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def,
>>  || bitmap_bit_p (artificial_uses, dregno)
>>  || df_ignore_stack_reg (dregno)))
>>  {
>> -  rtx reg = (DF_REF_LOC (def))
>> -? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
>> -  df_set_note (REG_UNUSED, insn, reg);
>> +  //rtx reg = (DF_REF_LOC (def))
>> +  //  ? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
>> +  //df_set_note (REG_UNUSED, insn, reg);
>>dead_debug_insert_temp (debug, dregno, insn, 
>> DEBUG_TEMP_AFTER_WITH_REG);
>>if (REG_DEAD_DEBUGGING)
>>  df_print_note ("adding 3: ", insn, REG_NOTES (insn));
> 
> I don't think this can be right.  The last hunk of the var-tracking.cc
> patch also seems to be reverting a correct change.
> 

We generate sequential registers using (subreg V16QI (reg 00mode) 16)
and (reg OOmode 0)
where OOmode is 256 bit and V16QI is 128 bits in order to generate
sequential register pair. If I keep the above REG_UNUSED notes ira generates
REG_UNUSED and in cprop_harreg pass and dce pass deletes store pairs and
we get incorrect code.

By commenting REG_UNUSED notes it is not generated and we get the correct store
pair fusion and cprop_hardreg and dce doesn't deletes them.

Ple

Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target

2024-02-14 Thread Ajit Agarwal



On 14/02/24 7:22 pm, Ajit Agarwal wrote:
> Hello Richard:
> 
> 
> On 14/02/24 4:03 pm, Richard Sandiford wrote:
>> Hi,
>>
>> Thanks for working on this.
>>
>> You posted a version of this patch on Sunday too.  If you need to repost
>> to fix bugs or make other improvements, could you describe the changes
>> that you've made since the previous version?  It makes things easier
>> to follow.
> 
> Sure. Sorry for that I forgot to add that.

There were certain asserts that I have removed it in the earlier
patch that I have sent on Sunday and forgot to keep them.
I have addressed them in this patch.
I have done rtl_dce changes and they were not deleting some of
the unwanted moves and hence I changed the code to address this in this patch.

Thanks & Regards
Ajit
> 
>>
>> Also, sorry for starting with a meta discussion about reviews, but
>> there are multiple types of review comment, including:
>>
>> (1) Suggestions for changes that are worded as suggestions.
>>
>> (2) Suggestions for changes that are worded as questions ("Wouldn't it be
>> better to do X?", etc).
>>
>> (3) Questions asking for an explanation or for more information.
>>
>> Just sending a new patch makes sense when the previous review comments
>> were all like (1), and arguably also (1)+(2).  But Alex's previous review
>> included (3) as well.  Could you go back and respond to his questions there?
>> It would help understand some of the design choices.
>>
> 
> I have responded to Alex comments for the previous patches.
> I have incorporated all of his comments in this patch.
> 
>  
>> A natural starting point when reviewing a patch like this is to diff
>> the current aarch64-ldp-fusion.cc with the new pair-fusion.cc.  This shows
>> many of the kind of changes that I'd expect.  But it also seems to include
>> some code reordering, such as putting fuse_pair after try_fuse_pair.
>> If some reordering is necessary, could you try to organise the patch as
>> a series in which the reordering is a separate step?  It's a bit hard
>> to review at the moment.  (Reordering for cosmetic reasons is also OK,
>> but again please separate it out for ease of review.)
>>
>> Maybe one way of making the review easier would be to split the aarch64
>> pass into the "target-dependent" and "target-independent" pieces
>> in-place, i.e. keeping everything within aarch64-ldp-fusion.cc, and then
>> (as separate patches) move the target-independent pieces outside
>> config/aarch64.
>>
> Sure I will do that.
> 
>> The patch includes:
>>
>>> * emit-rtl.cc: Modify ge with gt on PolyINT data structure.
>>> * dce.cc: Add changes not to delete the load store pair.
>>> * rtl-ssa/changes.cc: Modified assert code.
>>> * var-tracking.cc: Modified assert code.
>>> * df-problems.cc: Not to generate REG_UNUSED for multi
>>> word registers that is requied for rs6000 target.
>>
>> Please submit these separately, as independent preparatory patches,
>> with an explanation for why they're needed & correct.  But:
>>
> Sure I will do that.
> 
>>> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
>>> index 88ee0dd67fc..a8d0ee7c4db 100644
>>> --- a/gcc/df-problems.cc
>>> +++ b/gcc/df-problems.cc
>>> @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>>> df_mw_hardreg *mws,
>>>if (df_whole_mw_reg_unused_p (mws, live, artificial_uses))
>>>  {
>>>unsigned int regno = mws->start_regno;
>>> -  df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>> +  //df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>>dead_debug_insert_temp (debug, regno, insn, 
>>> DEBUG_TEMP_AFTER_WITH_REG);
>>>  
>>>if (REG_DEAD_DEBUGGING)
>>> @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>>> df_mw_hardreg *mws,
>>> if (!bitmap_bit_p (live, r)
>>> && !bitmap_bit_p (artificial_uses, r))
>>>   {
>>> -   df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>>> +  // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>>> dead_debug_insert_temp (debug, r, insn, DEBUG_TEMP_AFTER_WITH_REG);
>>> if (REG_DEAD_DEBUGGING)
>>>   df_print_note ("adding 2: ", insn, REG_NOTES (insn));
>>> @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def,
>>> || bitmap_bit_p (artific

Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target

2024-02-14 Thread Ajit Agarwal
Hello Sam:

On 14/02/24 10:50 pm, Sam James wrote:
> 
> Ajit Agarwal  writes:
> 
>> Hello Richard:
>>
>>
>> On 14/02/24 4:03 pm, Richard Sandiford wrote:
>>> Hi,
>>>
>>> Thanks for working on this.
>>>
>>> You posted a version of this patch on Sunday too.  If you need to repost
>>> to fix bugs or make other improvements, could you describe the changes
>>> that you've made since the previous version?  It makes things easier
>>> to follow.
>>
>> Sure. Sorry for that I forgot to add that.
>>
>>>
>>> Also, sorry for starting with a meta discussion about reviews, but
>>> there are multiple types of review comment, including:
>>>
>>> (1) Suggestions for changes that are worded as suggestions.
>>>
>>> (2) Suggestions for changes that are worded as questions ("Wouldn't it be
>>> better to do X?", etc).
>>>
>>> (3) Questions asking for an explanation or for more information.
>>>
>>> Just sending a new patch makes sense when the previous review comments
>>> were all like (1), and arguably also (1)+(2).  But Alex's previous review
>>> included (3) as well.  Could you go back and respond to his questions there?
>>> It would help understand some of the design choices.
>>>
>>
>> I have responded to Alex comments for the previous patches.
>> I have incorporated all of his comments in this patch.
>>
>>  
>>> A natural starting point when reviewing a patch like this is to diff
>>> the current aarch64-ldp-fusion.cc with the new pair-fusion.cc.  This shows
>>> many of the kind of changes that I'd expect.  But it also seems to include
>>> some code reordering, such as putting fuse_pair after try_fuse_pair.
>>> If some reordering is necessary, could you try to organise the patch as
>>> a series in which the reordering is a separate step?  It's a bit hard
>>> to review at the moment.  (Reordering for cosmetic reasons is also OK,
>>> but again please separate it out for ease of review.)
>>>
>>> Maybe one way of making the review easier would be to split the aarch64
>>> pass into the "target-dependent" and "target-independent" pieces
>>> in-place, i.e. keeping everything within aarch64-ldp-fusion.cc, and then
>>> (as separate patches) move the target-independent pieces outside
>>> config/aarch64.
>>>
>> Sure I will do that.
>>
>>> The patch includes:
>>>
>>>>* emit-rtl.cc: Modify ge with gt on PolyINT data structure.
>>>>* dce.cc: Add changes not to delete the load store pair.
>>>>* rtl-ssa/changes.cc: Modified assert code.
>>>>* var-tracking.cc: Modified assert code.
>>>>* df-problems.cc: Not to generate REG_UNUSED for multi
>>>>word registers that is requied for rs6000 target.
>>>
>>> Please submit these separately, as independent preparatory patches,
>>> with an explanation for why they're needed & correct.  But:
>>>
>> Sure I will do that.
>>
>>>> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
>>>> index 88ee0dd67fc..a8d0ee7c4db 100644
>>>> --- a/gcc/df-problems.cc
>>>> +++ b/gcc/df-problems.cc
>>>> @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>>>> df_mw_hardreg *mws,
>>>>if (df_whole_mw_reg_unused_p (mws, live, artificial_uses))
>>>>  {
>>>>unsigned int regno = mws->start_regno;
>>>> -  df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>>> +  //df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>>>dead_debug_insert_temp (debug, regno, insn, 
>>>> DEBUG_TEMP_AFTER_WITH_REG);
>>>>  
>>>>if (REG_DEAD_DEBUGGING)
>>>> @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>>>> df_mw_hardreg *mws,
>>>>if (!bitmap_bit_p (live, r)
>>>>&& !bitmap_bit_p (artificial_uses, r))
>>>>  {
>>>> -  df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>>>> + // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
> 
> I just want to emphasise here:
> a) adding out commented code is very unusual (I know a reviewer picked
> up on that already);
> 
> b) if you are going to comment something out as a hack / you need help,
> please *clearly flag that* (apologies if I missed it), and possib

Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target

2024-02-14 Thread Ajit Agarwal



On 14/02/24 10:56 pm, Richard Sandiford wrote:
> Ajit Agarwal  writes:
>>>> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
>>>> index 88ee0dd67fc..a8d0ee7c4db 100644
>>>> --- a/gcc/df-problems.cc
>>>> +++ b/gcc/df-problems.cc
>>>> @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>>>> df_mw_hardreg *mws,
>>>>if (df_whole_mw_reg_unused_p (mws, live, artificial_uses))
>>>>  {
>>>>unsigned int regno = mws->start_regno;
>>>> -  df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>>> +  //df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>>>dead_debug_insert_temp (debug, regno, insn, 
>>>> DEBUG_TEMP_AFTER_WITH_REG);
>>>>  
>>>>if (REG_DEAD_DEBUGGING)
>>>> @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>>>> df_mw_hardreg *mws,
>>>>if (!bitmap_bit_p (live, r)
>>>>&& !bitmap_bit_p (artificial_uses, r))
>>>>  {
>>>> -  df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>>>> + // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>>>>dead_debug_insert_temp (debug, r, insn, DEBUG_TEMP_AFTER_WITH_REG);
>>>>if (REG_DEAD_DEBUGGING)
>>>>  df_print_note ("adding 2: ", insn, REG_NOTES (insn));
>>>> @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def,
>>>>|| bitmap_bit_p (artificial_uses, dregno)
>>>>|| df_ignore_stack_reg (dregno)))
>>>>  {
>>>> -  rtx reg = (DF_REF_LOC (def))
>>>> -? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
>>>> -  df_set_note (REG_UNUSED, insn, reg);
>>>> +  //rtx reg = (DF_REF_LOC (def))
>>>> +  //  ? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
>>>> +  //df_set_note (REG_UNUSED, insn, reg);
>>>>dead_debug_insert_temp (debug, dregno, insn, 
>>>> DEBUG_TEMP_AFTER_WITH_REG);
>>>>if (REG_DEAD_DEBUGGING)
>>>>df_print_note ("adding 3: ", insn, REG_NOTES (insn));
>>>
>>> I don't think this can be right.  The last hunk of the var-tracking.cc
>>> patch also seems to be reverting a correct change.
>>>
>>
>> We generate sequential registers using (subreg V16QI (reg 00mode) 16)
>> and (reg OOmode 0)
>> where OOmode is 256 bit and V16QI is 128 bits in order to generate
>> sequential register pair.
> 
> OK.  As I mentioned in the other message I just sent, it seems pretty
> standard to use a 256-bit mode to represent a pair of 128-bit values.
> In that case:
> 
> - (reg:OO R) always refers to both registers in the pair, and any assignment
>   to it modifies both registers in the pair
> 
> - (subreg:V16QI (reg:OO R) 0) refers to the first register only, and can
>   be modified independently of the second register
> 
> - (subreg:V16QI (reg:OO R) 16) refers to the second register only, and can
>   be modified independently of the first register
> 
> Is that how you're using it?
> 

This is how I use it.
(insn 27 21 33 2 (set (reg:OO 157 [ vect__5.11_76 ])

(insn 21 35 27 2 (set (subreg:V2DI (reg:OO 157 [ vect__5.11_76 ]) 16)

to generate sequential registers. With the above sequential registers
are generated by RA.


> One thing to be wary of is that it isn't possible to assign to two
> subregs of the same reg in a single instruction (at least AFAIK).
> So any operation that wants to store to both registers in the pair
> must store to (reg:OO R) itself, not to the two subregs.
> 
>> If I keep the above REG_UNUSED notes ira generates
>> REG_UNUSED and in cprop_harreg pass and dce pass deletes store pairs and
>> we get incorrect code.
>>
>> By commenting REG_UNUSED notes it is not generated and we get the correct 
>> store
>> pair fusion and cprop_hardreg and dce doesn't deletes them.
>>
>> Please let me know is there are better ways to address this instead of 
>> commenting
>> above generation of REG_UNUSED notes.
> 
> Could you quote an example rtl sequence that includes incorrect notes?
> It might help to understand the problem a bit more.
>

Here is the rtl code:

(insn 21 35 27 2 (set (subreg:V2DI (reg:OO 157 [ vect__5.11_76 ]) 16)
(plus:V2DI (reg:V2DI 153 [ vect__4.10_72 ])
(reg:V2DI 154 [ _63 ]))) 
"/home/aagarwa/gcc-sources-fusion/gcc/testsuite/gcc.c-torture/execute/20030928-1.c":11:18
 1706 {addv2di3}
 (expr_list:REG_DEAD (reg:V2

Re: [PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).

2024-02-14 Thread Ajit Agarwal
Hello Richard:


On 14/02/24 10:45 pm, Richard Sandiford wrote:
> Ajit Agarwal  writes:
>>>> diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
>>>> index 1856fa4884f..ffc47a6eaa0 100644
>>>> --- a/gcc/emit-rtl.cc
>>>> +++ b/gcc/emit-rtl.cc
>>>> @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode 
>>>> imode,
>>>>  return false;
>>>>  
>>>>/* The subreg offset cannot be outside the inner object.  */
>>>> -  if (maybe_ge (offset, isize))
>>>> +  if (maybe_gt (offset, isize))
>>>>  return false;
>>>
>>> Can you explain why this change is needed?
>>>
>>
>> This is required in rs6000 target where we generate the subreg
>> with offset 16 from OO mode (256 bit) to 128 bit vector modes.
>> Otherwise it segfaults.
> 
> Could you go into more detail?  Why does that subreg lead to a segfault?
> 
> In itself, a 16-byte subreg at byte offset 16 into a 32-byte pair is pretty
> standard.  AArch64 uses this too for its vector load/store pairs (and for
> structure pairs more generally).
> 

If we want to create subreg V16QI (reg OO R) 16) imode is V16QI (isize = 16) 
and offset 
is 16. maybe_ge (offset, isize) return true and validate_subreg returns false;

Hence above subreg is not generated and we generate incorrect code.

Thats why I have modified to maybe_gt (offset, isize).

Thanks & Regards
Ajit
> Thanks,
> Richard


Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target

2024-02-14 Thread Ajit Agarwal
Hello Richard:

On 15/02/24 1:14 am, Richard Sandiford wrote:
> Ajit Agarwal  writes:
>> On 14/02/24 10:56 pm, Richard Sandiford wrote:
>>> Ajit Agarwal  writes:
>>>>>> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
>>>>>> index 88ee0dd67fc..a8d0ee7c4db 100644
>>>>>> --- a/gcc/df-problems.cc
>>>>>> +++ b/gcc/df-problems.cc
>>>>>> @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>>>>>> df_mw_hardreg *mws,
>>>>>>if (df_whole_mw_reg_unused_p (mws, live, artificial_uses))
>>>>>>  {
>>>>>>unsigned int regno = mws->start_regno;
>>>>>> -  df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>>>>> +  //df_set_note (REG_UNUSED, insn, mws->mw_reg);
>>>>>>dead_debug_insert_temp (debug, regno, insn, 
>>>>>> DEBUG_TEMP_AFTER_WITH_REG);
>>>>>>  
>>>>>>if (REG_DEAD_DEBUGGING)
>>>>>> @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct 
>>>>>> df_mw_hardreg *mws,
>>>>>>  if (!bitmap_bit_p (live, r)
>>>>>>  && !bitmap_bit_p (artificial_uses, r))
>>>>>>{
>>>>>> -df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>>>>>> +   // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]);
>>>>>>  dead_debug_insert_temp (debug, r, insn, 
>>>>>> DEBUG_TEMP_AFTER_WITH_REG);
>>>>>>  if (REG_DEAD_DEBUGGING)
>>>>>>df_print_note ("adding 2: ", insn, REG_NOTES (insn));
>>>>>> @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def,
>>>>>>  || bitmap_bit_p (artificial_uses, dregno)
>>>>>>  || df_ignore_stack_reg (dregno)))
>>>>>>  {
>>>>>> -  rtx reg = (DF_REF_LOC (def))
>>>>>> -? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
>>>>>> -  df_set_note (REG_UNUSED, insn, reg);
>>>>>> +  //rtx reg = (DF_REF_LOC (def))
>>>>>> +  //  ? *DF_REF_REAL_LOC (def): DF_REF_REG (def);
>>>>>> +  //df_set_note (REG_UNUSED, insn, reg);
>>>>>>dead_debug_insert_temp (debug, dregno, insn, 
>>>>>> DEBUG_TEMP_AFTER_WITH_REG);
>>>>>>if (REG_DEAD_DEBUGGING)
>>>>>>  df_print_note ("adding 3: ", insn, REG_NOTES (insn));
>>>>>
>>>>> I don't think this can be right.  The last hunk of the var-tracking.cc
>>>>> patch also seems to be reverting a correct change.
>>>>>
>>>>
>>>> We generate sequential registers using (subreg V16QI (reg 00mode) 16)
>>>> and (reg OOmode 0)
>>>> where OOmode is 256 bit and V16QI is 128 bits in order to generate
>>>> sequential register pair.
>>>
>>> OK.  As I mentioned in the other message I just sent, it seems pretty
>>> standard to use a 256-bit mode to represent a pair of 128-bit values.
>>> In that case:
>>>
>>> - (reg:OO R) always refers to both registers in the pair, and any assignment
>>>   to it modifies both registers in the pair
>>>
>>> - (subreg:V16QI (reg:OO R) 0) refers to the first register only, and can
>>>   be modified independently of the second register
>>>
>>> - (subreg:V16QI (reg:OO R) 16) refers to the second register only, and can
>>>   be modified independently of the first register
>>>
>>> Is that how you're using it?
>>>
>>
>> This is how I use it.
>> (insn 27 21 33 2 (set (reg:OO 157 [ vect__5.11_76 ])
>>
>> (insn 21 35 27 2 (set (subreg:V2DI (reg:OO 157 [ vect__5.11_76 ]) 16)
>>
>> to generate sequential registers. With the above sequential registers
>> are generated by RA.
>>
>>
>>> One thing to be wary of is that it isn't possible to assign to two
>>> subregs of the same reg in a single instruction (at least AFAIK).
>>> So any operation that wants to store to both registers in the pair
>>> must store to (reg:OO R) itself, not to the two subregs.
>>>
>>>> If I keep the above REG_UNUSED notes ira generates
>>>> REG_UNUSED and in cprop_harreg pass and dce pass deletes store pairs and
>

Re: [PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).

2024-02-14 Thread Ajit Agarwal
Hello Richard:

On 15/02/24 2:21 am, Richard Sandiford wrote:
> Ajit Agarwal  writes:
>> Hello Richard:
>>
>>
>> On 14/02/24 10:45 pm, Richard Sandiford wrote:
>>> Ajit Agarwal  writes:
>>>>>> diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
>>>>>> index 1856fa4884f..ffc47a6eaa0 100644
>>>>>> --- a/gcc/emit-rtl.cc
>>>>>> +++ b/gcc/emit-rtl.cc
>>>>>> @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode 
>>>>>> imode,
>>>>>>  return false;
>>>>>>  
>>>>>>/* The subreg offset cannot be outside the inner object.  */
>>>>>> -  if (maybe_ge (offset, isize))
>>>>>> +  if (maybe_gt (offset, isize))
>>>>>>  return false;
>>>>>
>>>>> Can you explain why this change is needed?
>>>>>
>>>>
>>>> This is required in rs6000 target where we generate the subreg
>>>> with offset 16 from OO mode (256 bit) to 128 bit vector modes.
>>>> Otherwise it segfaults.
>>>
>>> Could you go into more detail?  Why does that subreg lead to a segfault?
>>>
>>> In itself, a 16-byte subreg at byte offset 16 into a 32-byte pair is pretty
>>> standard.  AArch64 uses this too for its vector load/store pairs (and for
>>> structure pairs more generally).
>>>
>>
>> If we want to create subreg V16QI (reg OO R) 16) imode is V16QI (isize = 16) 
>> and offset 
>> is 16. maybe_ge (offset, isize) return true and validate_subreg returns 
>> false;
> 
> isize is supposed to be the size of the "inner mode", which in this
> case is OO.  Since OO is a 32-bit mode, I would expect isize to be 32
> rather than 16.  Is that not the case?
> 
> Or is the problem that something is trying to take a subreg of a subreg?
> If so, that is only valid in certain cases.  It isn't for example valid
> to use a subreg operation to move between (subreg:V16QI (reg:OO X) 16)
> and (subreg:V16QI (reg:OO X) 0).
>

The above changes are not required. emit-rtl.cc changes are not required anymore
as I have fixed in rs6000 target fusion code while fixing the modes
of src and dest as same for SET rtx as you have suggested for REG_UNUSED
issues.

Thanks a lot for your help.

Thanks & Regards
Ajit
 
> Thanks,
> Richard


[PATCH 0/1 V2] Target independent code for common infrastructure of load,store fusion for rs6000 and aarch64 target.

2024-02-15 Thread Ajit Agarwal
Hello Richard:

As per your suggestion I have divided the patch into target independent
and target dependent for aarch64 target. I kept aarch64-ldp-fusion same
and did not change that.

Common infrastructure of load store pair fusion is divided into
target independent and target dependent code for rs6000 and aarch64
target.

Target independent code is structured in the following files.
gcc/pair-fusion-base.h
gcc/pair-fusion-common.cc
gcc/pair-fusion.cc

Target independent code is the Generic code with pure virtual
function to interface betwwen target independent and dependent
code.

Thanks & Regards
Ajit

Target independent code for common infrastructure of load
store fusion for rs6000 and aarch64 target.

Common infrastructure of load store pair fusion is divided into
target independent and target dependent code for rs6000 and aarch64
target.

Target independent code is structured in the following files.
gcc/pair-fusion-base.h
gcc/pair-fusion-common.cc
gcc/pair-fusion.cc

Target independent code is the Generic code with pure virtual
function to interface betwwen target independent and dependent
code.

2024-02-15  Ajit Kumar Agarwal  

gcc/ChangeLog:

* pair-fusion-base.h: Generic header code for load store fusion
that can be shared across different architectures.
* pair-fusion-common.cc: Generic source code for load store
fusion that can be shared across different architectures.
* pair-fusion.cc: Generic implementation of pair_fusion class
defined in pair-fusion-base.h
* Makefile.in: Add new executable pair-fusion.o and
pair-fusion-common.o.
---
 gcc/Makefile.in   |2 +
 gcc/pair-fusion-base.h|  586 ++
 gcc/pair-fusion-common.cc | 1202 
 gcc/pair-fusion.cc| 1225 +
 4 files changed, 3015 insertions(+)
 create mode 100644 gcc/pair-fusion-base.h
 create mode 100644 gcc/pair-fusion-common.cc
 create mode 100644 gcc/pair-fusion.cc

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index a74761b7ab3..df5061ddfe7 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1563,6 +1563,8 @@ OBJS = \
ipa-strub.o \
ipa.o \
ira.o \
+   pair-fusion-common.o \
+   pair-fusion.o \
ira-build.o \
ira-costs.o \
ira-conflicts.o \
diff --git a/gcc/pair-fusion-base.h b/gcc/pair-fusion-base.h
new file mode 100644
index 000..fdaf4fd743d
--- /dev/null
+++ b/gcc/pair-fusion-base.h
@@ -0,0 +1,586 @@
+// Generic code for Pair MEM  fusion optimization pass.
+// Copyright (C) 2023-2024 Free Software Foundation, Inc.
+//
+// This file is part of GCC.
+//
+// GCC is free software; you can redistribute it and/or modify it
+// under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 3, or (at your option)
+// any later version.
+//
+// GCC is distributed in the hope that it will be useful, but
+// WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+// General Public License for more details.
+//
+// You should have received a copy of the GNU General Public License
+// along with GCC; see the file COPYING3.  If not see
+// .
+
+#ifndef GCC_PAIR_FUSION_H
+#define GCC_PAIR_FUSION_H
+#define INCLUDE_ALGORITHM
+#define INCLUDE_FUNCTIONAL
+#define INCLUDE_LIST
+#define INCLUDE_TYPE_TRAITS
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "df.h"
+#include "rtl-iter.h"
+#include "rtl-ssa.h"
+#include "cfgcleanup.h"
+#include "tree-pass.h"
+#include "ordered-hash-map.h"
+#include "tree-dfa.h"
+#include "fold-const.h"
+#include "tree-hash-traits.h"
+#include "print-tree.h"
+#include "insn-attr.h"
+using namespace rtl_ssa;
+// We pack these fields (load_p, fpsimd_p, and size) into an integer
+// (LFS) which we use as part of the key into the main hash tables.
+//
+// The idea is that we group candidates together only if they agree on
+// the fields below.  Candidates that disagree on any of these
+// properties shouldn't be merged together.
+struct lfs_fields
+{
+  bool load_p;
+  bool fpsimd_p;
+  unsigned size;
+};
+
+using insn_list_t = std::list;
+using insn_iter_t = insn_list_t::iterator;
+
+// Information about the accesses at a given offset from a particular
+// base.  Stored in an access_group, see below.
+struct access_record
+{
+  poly_int64 offset;
+  std::list cand_insns;
+  std::list::iterator place;
+
+  access_record (poly_int64 off) : offset (off) {}
+};
+
+// A group of accesses where adjacent accesses could be ldp/stp
+// candidates.  The splay tree supports efficient insertion,
+// while the list supports efficient iteration.
+struct access_group
+{
+  splay_tree tree;
+  std::list list;
+
+  template
+  inline void track (Alloc node_alloc, poly_int64 offset, insn_info *insn

Re: [PATCH 0/1 V2] Target independent code for common infrastructure of load,store fusion for rs6000 and aarch64 target.

2024-02-15 Thread Ajit Agarwal
Hello Alex:

On 15/02/24 10:12 pm, Alex Coplan wrote:
> On 15/02/2024 21:24, Ajit Agarwal wrote:
>> Hello Richard:
>>
>> As per your suggestion I have divided the patch into target independent
>> and target dependent for aarch64 target. I kept aarch64-ldp-fusion same
>> and did not change that.
> 
> I'm not sure this was what Richard suggested doing, though.
> He said (from
> https://gcc.gnu.org/pipermail/gcc-patches/2024-February/645545.html):
> 
>> Maybe one way of making the review easier would be to split the aarch64
>> pass into the "target-dependent" and "target-independent" pieces
>> in-place, i.e. keeping everything within aarch64-ldp-fusion.cc, and then
>> (as separate patches) move the target-independent pieces outside
>> config/aarch64.
> 
> but this adds the target-independent parts separately instead of
> splitting it out within config/aarch64 (which I agree should make the
> review easier).

I am sorry I didnt follow. Can you kindly elaborate on this.

Thanks & Regards
Ajit
> 
> Thanks,
> Alex
> 
>>
>> Common infrastructure of load store pair fusion is divided into
>> target independent and target dependent code for rs6000 and aarch64
>> target.
>>
>> Target independent code is structured in the following files.
>> gcc/pair-fusion-base.h
>> gcc/pair-fusion-common.cc
>> gcc/pair-fusion.cc
>>
>> Target independent code is the Generic code with pure virtual
>> function to interface betwwen target independent and dependent
>> code.
>>
>> Thanks & Regards
>> Ajit
>>
>> Target independent code for common infrastructure of load
>> store fusion for rs6000 and aarch64 target.
>>
>> Common infrastructure of load store pair fusion is divided into
>> target independent and target dependent code for rs6000 and aarch64
>> target.
>>
>> Target independent code is structured in the following files.
>> gcc/pair-fusion-base.h
>> gcc/pair-fusion-common.cc
>> gcc/pair-fusion.cc
>>
>> Target independent code is the Generic code with pure virtual
>> function to interface betwwen target independent and dependent
>> code.
>>
>> 2024-02-15  Ajit Kumar Agarwal  
>>
>> gcc/ChangeLog:
>>
>>  * pair-fusion-base.h: Generic header code for load store fusion
>>  that can be shared across different architectures.
>>  * pair-fusion-common.cc: Generic source code for load store
>>  fusion that can be shared across different architectures.
>>  * pair-fusion.cc: Generic implementation of pair_fusion class
>>  defined in pair-fusion-base.h
>>  * Makefile.in: Add new executable pair-fusion.o and
>>  pair-fusion-common.o.
>> ---
>>  gcc/Makefile.in   |2 +
>>  gcc/pair-fusion-base.h|  586 ++
>>  gcc/pair-fusion-common.cc | 1202 
>>  gcc/pair-fusion.cc| 1225 +
>>  4 files changed, 3015 insertions(+)
>>  create mode 100644 gcc/pair-fusion-base.h
>>  create mode 100644 gcc/pair-fusion-common.cc
>>  create mode 100644 gcc/pair-fusion.cc
>>
>> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
>> index a74761b7ab3..df5061ddfe7 100644
>> --- a/gcc/Makefile.in
>> +++ b/gcc/Makefile.in
>> @@ -1563,6 +1563,8 @@ OBJS = \
>>  ipa-strub.o \
>>  ipa.o \
>>  ira.o \
>> +pair-fusion-common.o \
>> +pair-fusion.o \
>>  ira-build.o \
>>  ira-costs.o \
>>  ira-conflicts.o \
>> diff --git a/gcc/pair-fusion-base.h b/gcc/pair-fusion-base.h
>> new file mode 100644
>> index 000..fdaf4fd743d
>> --- /dev/null
>> +++ b/gcc/pair-fusion-base.h
>> @@ -0,0 +1,586 @@
>> +// Generic code for Pair MEM  fusion optimization pass.
>> +// Copyright (C) 2023-2024 Free Software Foundation, Inc.
>> +//
>> +// This file is part of GCC.
>> +//
>> +// GCC is free software; you can redistribute it and/or modify it
>> +// under the terms of the GNU General Public License as published by
>> +// the Free Software Foundation; either version 3, or (at your option)
>> +// any later version.
>> +//
>> +// GCC is distributed in the hope that it will be useful, but
>> +// WITHOUT ANY WARRANTY; without even the implied warranty of
>> +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +// General Public License for more details.
>> +//
>> +// You should have received a copy of the GNU General Public License
>> +// a

Re: [PATCH 0/1 V2] Target independent code for common infrastructure of load,store fusion for rs6000 and aarch64 target.

2024-02-15 Thread Ajit Agarwal



On 15/02/24 10:43 pm, Alex Coplan wrote:
> So IIUC Richard was suggesting splitting into target-independent and
> target-dependent pieces within aarch64-ldp-fusion.cc as a first step,
> i.e. you introduce the abstractions (virtual functions) needed within
> that file.  That should hopefully be a relatively small diff.
> 
> Then in a separate patch you can move the target-independent parts out of
> config/aarch64.
> 
> Does that make sense?

Thanks a lot for explaining this. Sure I will do that and send the patch as per
above.

Thanks & Regards
Ajit


[PATCH 0/2 V2] aarch64: Place target independent and dependent code in one file.

2024-02-15 Thread Ajit Agarwal
Hello Alex/Richard:

I have placed target indpendent and target dependent code in
aarch64-ldp-fusion for load store fusion.

Common infrastructure of load store pair fusion is divided into
target independent and target dependent code.

Target independent code is the Generic code with pure virtual
function to interface betwwen target independent and dependent
code.

Target dependent code is the implementation of pure virtual
function for aarch64 target and the call to target independent
code.

Bootstrapped in aarch64-linux-gnu.

Thanks & Regards
Ajit


aarch64: Place target independent and dependent code in one file.

Common infrastructure of load store pair fusion is divided into
target independent and target dependent code.

Target independent code is the Generic code with pure virtual
function to interface betwwen target independent and dependent
code.

Target dependent code is the implementation of pure virtual
function for aarch64 target and the call to target independent
code.

2024-02-15  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/aarch64/aarch64-ldp-fusion.cc: Place target
independent and dependent code.
---
 gcc/config/aarch64/aarch64-ldp-fusion.cc | 3513 --
 1 file changed, 1842 insertions(+), 1671 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 22ed95eb743..0ab842e2bbb 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -17,6 +17,7 @@
 // along with GCC; see the file COPYING3.  If not see
 // .
 
+
 #define INCLUDE_ALGORITHM
 #define INCLUDE_FUNCTIONAL
 #define INCLUDE_LIST
@@ -37,13 +38,12 @@
 #include "tree-hash-traits.h"
 #include "print-tree.h"
 #include "insn-attr.h"
-
 using namespace rtl_ssa;
 
-static constexpr HOST_WIDE_INT LDP_IMM_BITS = 7;
-static constexpr HOST_WIDE_INT LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1));
-static constexpr HOST_WIDE_INT LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1;
-static constexpr HOST_WIDE_INT LDP_MIN_IMM = -LDP_MAX_IMM - 1;
+static constexpr HOST_WIDE_INT PAIR_MEM_IMM_BITS = 7;
+static constexpr HOST_WIDE_INT PAIR_MEM_IMM_SIGN_BIT = (1 << 
(PAIR_MEM_IMM_BITS - 1));
+static constexpr HOST_WIDE_INT PAIR_MEM_MAX_IMM = PAIR_MEM_IMM_SIGN_BIT - 1;
+static constexpr HOST_WIDE_INT PAIR_MEM_MIN_IMM = -PAIR_MEM_MAX_IMM - 1;
 
 // We pack these fields (load_p, fpsimd_p, and size) into an integer
 // (LFS) which we use as part of the key into the main hash tables.
@@ -138,8 +138,144 @@ struct alt_base
   poly_int64 offset;
 };
 
+// Class that implements a state machine for building the changes needed to 
form
+// a store pair instruction.  This allows us to easily build the changes in
+// program order, as required by rtl-ssa.
+struct stp_change_builder
+{
+  enum class state
+  {
+FIRST,
+INSERT,
+FIXUP_USE,
+LAST,
+DONE
+  };
+
+  enum class action
+  {
+TOMBSTONE,
+CHANGE,
+INSERT,
+FIXUP_USE
+  };
+
+  struct change
+  {
+action type;
+insn_info *insn;
+  };
+
+  bool done () const { return m_state == state::DONE; }
+
+  stp_change_builder (insn_info *insns[2],
+ insn_info *repurpose,
+ insn_info *dest)
+: m_state (state::FIRST), m_insns { insns[0], insns[1] },
+  m_repurpose (repurpose), m_dest (dest), m_use (nullptr) {}
+
+  change get_change () const
+  {
+switch (m_state)
+  {
+  case state::FIRST:
+   return {
+ m_insns[0] == m_repurpose ? action::CHANGE : action::TOMBSTONE,
+ m_insns[0]
+   };
+  case state::LAST:
+   return {
+ m_insns[1] == m_repurpose ? action::CHANGE : action::TOMBSTONE,
+ m_insns[1]
+   };
+  case state::INSERT:
+   return { action::INSERT, m_dest };
+  case state::FIXUP_USE:
+   return { action::FIXUP_USE, m_use->insn () };
+  case state::DONE:
+   break;
+  }
+
+gcc_unreachable ();
+  }
+
+  // Transition to the next state.
+  void advance ()
+  {
+switch (m_state)
+  {
+  case state::FIRST:
+   if (m_repurpose)
+ m_state = state::LAST;
+   else
+ m_state = state::INSERT;
+   break;
+  case state::INSERT:
+  {
+   def_info *def = memory_access (m_insns[0]->defs ());
+   while (*def->next_def ()->insn () <= *m_dest)
+ def = def->next_def ();
+
+   // Now we know DEF feeds the insertion point for the new stp.
+   // Look for any uses of DEF that will consume the new stp.
+   gcc_assert (*def->insn () <= *m_dest
+   && *def->next_def ()->insn () > *m_dest);
+
+   auto set = as_a (def);
+   for (auto use : set->nondebug_insn_uses ())
+ if (*use->insn () > *m_dest)
+   {
+ m_use = use;
+ break;
+   }
+
+   if (m_use)
+ m_state = state::FIXUP_USE;
+   else
+ m_state = state::LAST;
+   break;
+  }
+

[PATCH 2/2 V2] rs6000: Load store fusion for rs6000 target using common infrastructure

2024-02-15 Thread Ajit Agarwal
Hello All:

This patch is for load store fusion for rs6000 target using common 
infrastructure.


Common infrastructure using generic code for load store fusion of rs6000
target.

Generic code are implemented and defined  that can be used in target specific
code for aarch64 and rs6000 target.

Generic code are implemeneted in gcc/pair-fusion-base.h, 
gcc/pair-fusion-common.cc
and gcc/pair-fusion.cc.

Target specific code are added in rs600-vecload-fusion.cc that uses generic 
code.


Bootstrapped and regtested on powerpc64-linux-gnu.

Also ran spec 2017 benchmarks.

Thanks & Regards
Ajit



rs6000: Load store fusion for rs6000 target using common infrastructure

Common infrastructure using generic code for load store fusion of rs6000
target.

Generic code are implemented and defined  that can be used in target specific
code for aarch64 and rs6000 target.

Generic code are implemeneted in gcc/pair-fusion-base.h, 
gcc/pair-fusion-common.cc
and gcc/pair-fusion.cc.

Target specific code are added in rs600-vecload-fusion.cc that uses generic 
code.

2024-02-16  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/rs6000/rs6000-passes.def: New vecload fusion pass
before pass_early_remat.
* config/rs6000/rs6000-vecload-fusion.cc: Add new pass.
Add target specific implementation using generic code.
* config.gcc: Add new executable.
* config/rs6000/rs6000-protos.h: Add new prototype for vecload
fusion pass.
* config/rs6000/rs6000.cc: Add new prototype for vecload fusion
pass.
* config/rs6000/t-rs6000: Add new rule.
* pair-fusion-base.h: Generic header code for load store fusion
that can be shared across different architectures.
* pair-fusion-common.cc: Generic source code for load store
fusion that can be shared across different architectures.
* pair-fusion.cc: Generic implementation of pair_fusion class
defined in pair-fusion-base.h
* Makefile.in: Add new executable pair-fusion.o and
pair-fusion-common.o.
* rtl-ssa/changes.cc: Modified assert code.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/vecload-fusion.C: New test.
* g++.target/powerpc/vecload-fusion_1.C: New test.
* gcc.target/powerpc/mma-builtin-1.c: Modify test
---
 gcc/Makefile.in   |2 +
 gcc/config.gcc|4 +-
 gcc/config/rs6000/rs6000-passes.def   |4 +-
 gcc/config/rs6000/rs6000-protos.h |1 +
 gcc/config/rs6000/rs6000-vecload-fusion.cc|  699 ++
 gcc/config/rs6000/rs6000.cc   |1 +
 gcc/config/rs6000/t-rs6000|5 +
 gcc/pair-fusion-base.h|  586 
 gcc/pair-fusion-common.cc | 1202 
 gcc/pair-fusion.cc| 1225 +
 gcc/rtl-ssa/changes.cc|2 +-
 .../g++.target/powerpc/vecload-fusion.C   |   15 +
 .../g++.target/powerpc/vecload-fusion_1.C |   22 +
 .../gcc.target/powerpc/mma-builtin-1.c|4 +-
 14 files changed, 3766 insertions(+), 6 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-vecload-fusion.cc
 create mode 100644 gcc/pair-fusion-base.h
 create mode 100644 gcc/pair-fusion-common.cc
 create mode 100644 gcc/pair-fusion.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload-fusion.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload-fusion_1.C

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index a74761b7ab3..df5061ddfe7 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1563,6 +1563,8 @@ OBJS = \
ipa-strub.o \
ipa.o \
ira.o \
+   pair-fusion-common.o \
+   pair-fusion.o \
ira-build.o \
ira-costs.o \
ira-conflicts.o \
diff --git a/gcc/config.gcc b/gcc/config.gcc
index a0f9c672308..d696d826df8 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -518,7 +518,7 @@ or1k*-*-*)
;;
 powerpc*-*-*)
cpu_type=rs6000
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-fusion.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
@@ -555,7 +555,7 @@ riscv*)
;;
 rs6000*-*-*)
extra_options="${extra_options} g.opt fused-madd.opt 
rs6000/rs6000-tables.opt"
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-fusion.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-logue.cc 
\$(srcdir)/config/rs6000/rs6000-call.cc"
target_gtfiles="$target_gtfiles 
\

[PATCH V3 2/2] rs6000: Load store fusion for rs6000 target using common infrastructure

2024-02-19 Thread Ajit Agarwal
Hello All:

Changes in V3 since V2 patch.

Fdllowing changes are done in this patch.

a) Remove commented asserted code in rtl-ssa/changes.cc
b) Handle such code in rs6000-vecload-fusion.cc.

Same as V2:

Common infrastructure using generic code for load store fusion of rs6000
target.

Generic code are implemented and defined  that can be used in target specific
code for aarch64 and rs6000 target.

Generic code are implemeneted in gcc/pair-fusion-base.h, 
gcc/pair-fusion-common.cc
and gcc/pair-fusion.cc.

Code is implemented with pure virtual functions to interface with target
code.

Target specific code are added in rs600-vecload-fusion.cc that uses generic 
code.

Bootstrapped and regtested in powerpc64-linux-gnu.

Also ran spec cpu2017 benchmarks.

rs6000: Load store fusion for rs6000 target using common infrastructure

Common infrastructure using generic code for load store fusion of rs6000
target.

Generic code are implemented and defined  that can be used in target specific
code for aarch64 and rs6000 target.

Generic code are implemeneted in gcc/pair-fusion-base.h, 
gcc/pair-fusion-common.cc
and gcc/pair-fusion.cc.

Code is implemented with pure virtual functions to interface with target
code.

Target specific code are added in rs600-vecload-fusion.cc that uses generic 
code.

2024-02-19  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/rs6000/rs6000-passes.def: New vecload fusion pass
before pass_early_remat.
* config/rs6000/rs6000-vecload-fusion.cc: Add new pass.
Add target specific implementation using pure virtual
functions.
* config.gcc: Add new executable.
* config/rs6000/rs6000-protos.h: Add new prototype for vecload
fusion pass.
* config/rs6000/rs6000.cc: Add new prototype for vecload fusion
pass.
* config/rs6000/t-rs6000: Add new rule.
* pair-fusion-base.h: Generic header code for load store fusion
that can be shared across different architectures.
* pair-fusion-common.cc: Generic source code for load store
fusion that can be shared across different architectures.
* pair-fusion.cc: Generic implementation of pair_fusion class
defined in pair-fusion-base.h
* Makefile.in: Add new executable pair-fusion.o and
pair-fusion-common.o.
* rtl-ssa/accesses.h: Moved set_is_live_out_use as public
from private.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/vecload-fusion.C: New test.
* g++.target/powerpc/vecload-fusion_1.C: New test.
* gcc.target/powerpc/mma-builtin-1.c: Modify test.
---
 gcc/Makefile.in   |2 +
 gcc/config.gcc|4 +-
 gcc/config/rs6000/rs6000-passes.def   |4 +-
 gcc/config/rs6000/rs6000-protos.h |1 +
 gcc/config/rs6000/rs6000-vecload-fusion.cc|  701 ++
 gcc/config/rs6000/rs6000.cc   |1 +
 gcc/config/rs6000/t-rs6000|5 +
 gcc/pair-fusion-base.h|  586 
 gcc/pair-fusion-common.cc | 1202 
 gcc/pair-fusion.cc| 1225 +
 gcc/rtl-ssa/accesses.h|2 +-
 .../g++.target/powerpc/vecload-fusion.C   |   15 +
 .../g++.target/powerpc/vecload-fusion_1.C |   22 +
 .../gcc.target/powerpc/mma-builtin-1.c|4 +-
 14 files changed, 3768 insertions(+), 6 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-vecload-fusion.cc
 create mode 100644 gcc/pair-fusion-base.h
 create mode 100644 gcc/pair-fusion-common.cc
 create mode 100644 gcc/pair-fusion.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload-fusion.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload-fusion_1.C

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index a74761b7ab3..df5061ddfe7 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1563,6 +1563,8 @@ OBJS = \
ipa-strub.o \
ipa.o \
ira.o \
+   pair-fusion-common.o \
+   pair-fusion.o \
ira-build.o \
ira-costs.o \
ira-conflicts.o \
diff --git a/gcc/config.gcc b/gcc/config.gcc
index a0f9c672308..d696d826df8 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -518,7 +518,7 @@ or1k*-*-*)
;;
 powerpc*-*-*)
cpu_type=rs6000
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-fusion.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
@@ -555,7 +555,7 @@ riscv*)
;;
 rs6000*-*-*)
extra_options="${extra_options} g.opt fused-madd.opt 
rs6000/rs6000-tables.opt"
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-s

[PATCH V3 0/2] aarch64: Place target independent and dependent changed code in one file.

2024-02-23 Thread Ajit Agarwal
Hello Richard/Alex/Segher:

This patch adds the changed code for target independent and
dependent code for load store fusion.

Common infrastructure of load store pair fusion is
divided into target independent and target dependent
changed code.

Target independent code is the Generic code with
pure virtual function to interface betwwen target
independent and dependent code.

Target dependent code is the implementation of pure
virtual function for aarch64 target and the call
to target independent code.

Bootstrapped for aarch64-linux-gnu.

Thanks & Regards
Ajit

aarch64: Place target independent and dependent changed code in one file.

Common infrastructure of load store pair fusion is
divided into target independent and target dependent
changed code.

Target independent code is the Generic code with
pure virtual function to interface betwwen target
independent and dependent code.

Target dependent code is the implementation of pure
virtual function for aarch64 target and the call
to target independent code.

2024-02-23  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/aarch64/aarch64-ldp-fusion.cc: Place target
independent and dependent changed code.
---
 gcc/config/aarch64/aarch64-ldp-fusion.cc | 437 ---
 1 file changed, 305 insertions(+), 132 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 22ed95eb743..2ef22ff1e96 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -40,10 +40,10 @@
 
 using namespace rtl_ssa;
 
-static constexpr HOST_WIDE_INT LDP_IMM_BITS = 7;
-static constexpr HOST_WIDE_INT LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1));
-static constexpr HOST_WIDE_INT LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1;
-static constexpr HOST_WIDE_INT LDP_MIN_IMM = -LDP_MAX_IMM - 1;
+static constexpr HOST_WIDE_INT PAIR_MEM_IMM_BITS = 7;
+static constexpr HOST_WIDE_INT PAIR_MEM_IMM_SIGN_BIT = (1 << 
(PAIR_MEM_IMM_BITS - 1));
+static constexpr HOST_WIDE_INT PAIR_MEM_MAX_IMM = PAIR_MEM_IMM_SIGN_BIT - 1;
+static constexpr HOST_WIDE_INT PAIR_MEM_MIN_IMM = -PAIR_MEM_MAX_IMM - 1;
 
 // We pack these fields (load_p, fpsimd_p, and size) into an integer
 // (LFS) which we use as part of the key into the main hash tables.
@@ -138,8 +138,18 @@ struct alt_base
   poly_int64 offset;
 };
 
+// Virtual base class for load/store walkers used in alias analysis.
+struct alias_walker
+{
+  virtual bool conflict_p (int &budget) const = 0;
+  virtual insn_info *insn () const = 0;
+  virtual bool valid () const  = 0;
+  virtual void advance () = 0;
+};
+
+
 // State used by the pass for a given basic block.
-struct ldp_bb_info
+struct pair_fusion
 {
   using def_hash = nofree_ptr_hash;
   using expr_key_t = pair_hash>;
@@ -161,13 +171,13 @@ struct ldp_bb_info
   static const size_t obstack_alignment = sizeof (void *);
   bb_info *m_bb;
 
-  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
+  pair_fusion (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
   {
 obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
obstack_alignment, obstack_chunk_alloc,
obstack_chunk_free);
   }
-  ~ldp_bb_info ()
+  ~pair_fusion ()
   {
 obstack_free (&m_obstack, nullptr);
 
@@ -177,10 +187,50 @@ struct ldp_bb_info
bitmap_obstack_release (&m_bitmap_obstack);
   }
   }
+  void track_access (insn_info *, bool load, rtx mem);
+  void transform ();
+  void cleanup_tombstones ();
+  virtual void set_multiword_subreg (insn_info *i1, insn_info *i2,
+bool load_p) = 0;
+  virtual rtx gen_load_store_pair (rtx *pats,  rtx writeback,
+  bool load_p) = 0;
+  void merge_pairs (insn_list_t &, insn_list_t &,
+   bool load_p, unsigned access_size);
+  virtual void transform_for_base (int load_size, access_group &group) = 0;
+
+  bool try_fuse_pair (bool load_p, unsigned access_size,
+insn_info *i1, insn_info *i2);
+
+  bool fuse_pair (bool load_p, unsigned access_size,
+ int writeback,
+ insn_info *i1, insn_info *i2,
+ base_cand &base,
+ const insn_range_info &move_range);
+
+  void do_alias_analysis (insn_info *alias_hazards[4],
+ alias_walker *walkers[4],
+ bool load_p);
+
+  void track_tombstone (int uid);
+
+  bool track_via_mem_expr (insn_info *, rtx mem, lfs_fields lfs);
+
+  virtual bool is_fpsimd_op_p (rtx reg_op, machine_mode mem_mode,
+  bool load_p) = 0;
+
+  virtual bool pair_operand_mode_ok_p (machine_mode mode) = 0;
+  virtual bool pair_trailing_writeback_p () = 0;
+  virtual bool pair_check_register_operand (bool load_p, rtx reg_op,
+   machine_mode mem_mode) = 0;
+  virtual int pair_mem_alias_check_limit () = 0;

Re: [PATCH 0/2 V2] aarch64: Place target independent and dependent code in one file.

2024-02-23 Thread Ajit Agarwal
Hello Richard:

On 23/02/24 1:19 am, Richard Sandiford wrote:
> Ajit Agarwal  writes:
>> Hello Alex/Richard:
>>
>> I have placed target indpendent and target dependent code in
>> aarch64-ldp-fusion for load store fusion.
>>
>> Common infrastructure of load store pair fusion is divided into
>> target independent and target dependent code.
>>
>> Target independent code is the Generic code with pure virtual
>> function to interface betwwen target independent and dependent
>> code.
>>
>> Target dependent code is the implementation of pure virtual
>> function for aarch64 target and the call to target independent
>> code.
> 
> Thanks for the update.  This is still quite hard to review though.
> Sorry to ask for another round, but could you split it up further?
> The ideal thing would be if patches that move code do nothing other
> than move code, and if patches that change code do those changes
> in-place.
> 

As per your suggestion I have submitted new patch with above changes.
Sorry for inconvenience caused.

Thanks & Regards
Ajit


> Richard
> 
>>
>> Bootstrapped in aarch64-linux-gnu.
>>
>> Thanks & Regards
>> Ajit
>>
>>
>> aarch64: Place target independent and dependent code in one file.
>>
>> Common infrastructure of load store pair fusion is divided into
>> target independent and target dependent code.
>>
>> Target independent code is the Generic code with pure virtual
>> function to interface betwwen target independent and dependent
>> code.
>>
>> Target dependent code is the implementation of pure virtual
>> function for aarch64 target and the call to target independent
>> code.
>>
>> 2024-02-15  Ajit Kumar Agarwal  
>>
>> gcc/ChangeLog:
>>
>>  * config/aarch64/aarch64-ldp-fusion.cc: Place target
>>  independent and dependent code.
>> ---
>>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 3513 --
>>  1 file changed, 1842 insertions(+), 1671 deletions(-)
>>
>> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
>> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> index 22ed95eb743..0ab842e2bbb 100644
>> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
>> @@ -17,6 +17,7 @@
>>  // along with GCC; see the file COPYING3.  If not see
>>  // <http://www.gnu.org/licenses/>.
>>  
>> +
>>  #define INCLUDE_ALGORITHM
>>  #define INCLUDE_FUNCTIONAL
>>  #define INCLUDE_LIST
>> @@ -37,13 +38,12 @@
>>  #include "tree-hash-traits.h"
>>  #include "print-tree.h"
>>  #include "insn-attr.h"
>> -
>>  using namespace rtl_ssa;
>>  
>> -static constexpr HOST_WIDE_INT LDP_IMM_BITS = 7;
>> -static constexpr HOST_WIDE_INT LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1));
>> -static constexpr HOST_WIDE_INT LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1;
>> -static constexpr HOST_WIDE_INT LDP_MIN_IMM = -LDP_MAX_IMM - 1;
>> +static constexpr HOST_WIDE_INT PAIR_MEM_IMM_BITS = 7;
>> +static constexpr HOST_WIDE_INT PAIR_MEM_IMM_SIGN_BIT = (1 << 
>> (PAIR_MEM_IMM_BITS - 1));
>> +static constexpr HOST_WIDE_INT PAIR_MEM_MAX_IMM = PAIR_MEM_IMM_SIGN_BIT - 1;
>> +static constexpr HOST_WIDE_INT PAIR_MEM_MIN_IMM = -PAIR_MEM_MAX_IMM - 1;
>>  
>>  // We pack these fields (load_p, fpsimd_p, and size) into an integer
>>  // (LFS) which we use as part of the key into the main hash tables.
>> @@ -138,8 +138,144 @@ struct alt_base
>>poly_int64 offset;
>>  };
>>  
>> +// Class that implements a state machine for building the changes needed to 
>> form
>> +// a store pair instruction.  This allows us to easily build the changes in
>> +// program order, as required by rtl-ssa.
>> +struct stp_change_builder
>> +{
>> +  enum class state
>> +  {
>> +FIRST,
>> +INSERT,
>> +FIXUP_USE,
>> +LAST,
>> +DONE
>> +  };
>> +
>> +  enum class action
>> +  {
>> +TOMBSTONE,
>> +CHANGE,
>> +INSERT,
>> +FIXUP_USE
>> +  };
>> +
>> +  struct change
>> +  {
>> +action type;
>> +insn_info *insn;
>> +  };
>> +
>> +  bool done () const { return m_state == state::DONE; }
>> +
>> +  stp_change_builder (insn_info *insns[2],
>> +  insn_info *repurpose,
>> +  insn_info *dest)
>> +: m_state (state::FIRST), m_insns { insns[0], insns[1] },
>> +  m_repurpos

ReRe:[PATCH V3 0/2] aarch64: Place target independent and dependent changed and unchanged code in one file.

2024-02-26 Thread Ajit Agarwal
Hello Richard/Alex:

This patch has better diff with changed and unchanged code.
Unchanged code and some of the changed code  will be extracted 
into target independent headers and sources wherein target
deoendent code changed and unchanged code would be in target
dependent file like aarch64-ldp-fusion

Please review.

Thanks & Regards
Ajit

On 23/02/24 4:41 pm, Ajit Agarwal wrote:
> Hello Richard/Alex/Segher:
> 
> This patch adds the changed code for target independent and
> dependent code for load store fusion.
> 
> Common infrastructure of load store pair fusion is
> divided into target independent and target dependent
> changed code.
> 
> Target independent code is the Generic code with
> pure virtual function to interface betwwen target
> independent and dependent code.
> 
> Target dependent code is the implementation of pure
> virtual function for aarch64 target and the call
> to target independent code.
> 
> Bootstrapped for aarch64-linux-gnu.
> 
> Thanks & Regards
> Ajit
> 
> aarch64: Place target independent and dependent changed code in one file.
> 
> Common infrastructure of load store pair fusion is
> divided into target independent and target dependent
> changed code.
> 
> Target independent code is the Generic code with
> pure virtual function to interface betwwen target
> independent and dependent code.
> 
> Target dependent code is the implementation of pure
> virtual function for aarch64 target and the call
> to target independent code.
> 
> 2024-02-23  Ajit Kumar Agarwal  
> 
> gcc/ChangeLog:
> 
>   * config/aarch64/aarch64-ldp-fusion.cc: Place target
>   independent and dependent changed code.
> ---
>  gcc/config/aarch64/aarch64-ldp-fusion.cc | 437 ---
>  1 file changed, 305 insertions(+), 132 deletions(-)
> 
> diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc 
> b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> index 22ed95eb743..2ef22ff1e96 100644
> --- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
> +++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
> @@ -40,10 +40,10 @@
>  
>  using namespace rtl_ssa;
>  
> -static constexpr HOST_WIDE_INT LDP_IMM_BITS = 7;
> -static constexpr HOST_WIDE_INT LDP_IMM_SIGN_BIT = (1 << (LDP_IMM_BITS - 1));
> -static constexpr HOST_WIDE_INT LDP_MAX_IMM = LDP_IMM_SIGN_BIT - 1;
> -static constexpr HOST_WIDE_INT LDP_MIN_IMM = -LDP_MAX_IMM - 1;
> +static constexpr HOST_WIDE_INT PAIR_MEM_IMM_BITS = 7;
> +static constexpr HOST_WIDE_INT PAIR_MEM_IMM_SIGN_BIT = (1 << 
> (PAIR_MEM_IMM_BITS - 1));
> +static constexpr HOST_WIDE_INT PAIR_MEM_MAX_IMM = PAIR_MEM_IMM_SIGN_BIT - 1;
> +static constexpr HOST_WIDE_INT PAIR_MEM_MIN_IMM = -PAIR_MEM_MAX_IMM - 1;
>  
>  // We pack these fields (load_p, fpsimd_p, and size) into an integer
>  // (LFS) which we use as part of the key into the main hash tables.
> @@ -138,8 +138,18 @@ struct alt_base
>poly_int64 offset;
>  };
>  
> +// Virtual base class for load/store walkers used in alias analysis.
> +struct alias_walker
> +{
> +  virtual bool conflict_p (int &budget) const = 0;
> +  virtual insn_info *insn () const = 0;
> +  virtual bool valid () const  = 0;
> +  virtual void advance () = 0;
> +};
> +
> +
>  // State used by the pass for a given basic block.
> -struct ldp_bb_info
> +struct pair_fusion
>  {
>using def_hash = nofree_ptr_hash;
>using expr_key_t = pair_hash>;
> @@ -161,13 +171,13 @@ struct ldp_bb_info
>static const size_t obstack_alignment = sizeof (void *);
>bb_info *m_bb;
>  
> -  ldp_bb_info (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
> +  pair_fusion (bb_info *bb) : m_bb (bb), m_emitted_tombstone (false)
>{
>  obstack_specify_allocation (&m_obstack, OBSTACK_CHUNK_SIZE,
>   obstack_alignment, obstack_chunk_alloc,
>   obstack_chunk_free);
>}
> -  ~ldp_bb_info ()
> +  ~pair_fusion ()
>{
>  obstack_free (&m_obstack, nullptr);
>  
> @@ -177,10 +187,50 @@ struct ldp_bb_info
>   bitmap_obstack_release (&m_bitmap_obstack);
>}
>}
> +  void track_access (insn_info *, bool load, rtx mem);
> +  void transform ();
> +  void cleanup_tombstones ();
> +  virtual void set_multiword_subreg (insn_info *i1, insn_info *i2,
> +  bool load_p) = 0;
> +  virtual rtx gen_load_store_pair (rtx *pats,  rtx writeback,
> +bool load_p) = 0;
> +  void merge_pairs (insn_list_t &, insn_list_t &,
> + bool load_p, unsigned access_size);
> +  virtual void transform_for_base (int load_size, access_group &group) = 0;
> +
>

[PATCH V1] rs6000: New pass for replacement of adjacent (load) lxv with lxvp

2024-01-14 Thread Ajit Agarwal
Hello All:

This patch add the vecload pass to replace adjacent memory accesses lxv with 
lxvp
instructions. This pass is added before ira pass.

vecload pass removes one of the defined adjacent lxv (load) and replace with 
lxvp.
Due to removal of one of the defined loads the allocno is has only uses but
not defs.

Due to this IRA pass doesn't assign register pairs like registers in sequence.
Changes are made in IRA register allocator to assign sequential registers to
adjacent loads.

Some of the registers are cleared and are not set as profitable registers due 
to zero cost is greater than negative costs and checks are added to compare
positive costs.

LRA register is changed not to reassign them to different register and form
the sequential register pairs intact.


contrib/check_GNU_style.sh run on patch looks good.

Bootstrapped and regtested for powerpc64-linux-gnu.

Spec2017 benchmarks are run and I get impressive benefits for some of the FP
benchmarks.

Thanks & Regards
Ajit


rs6000: New  pass for replacement of adjacent lxv with lxvp.

New pass to replace adjacent memory addresses lxv with lxvp.
This pass is registered before ira rtl pass.

2024-01-14  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/rs6000/rs6000-passes.def: Registered vecload pass.
* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
* config.gcc: Add new executable.
* config/rs6000/rs6000-protos.h: Add new prototype for vecload
pass.
* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
* config/rs6000/t-rs6000: Add new rule.
* ira-color.cc: Form register pair with adjacent loads.
* lra-assigns.cc: Skip modifying register pair assignment.
* lra-int.h: Add pseudo_conflict field in lra_reg_p structure.
* lra.cc: Initialize pseudo_conflict field.
* ira-build.cc: Use of REG_FREQ.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/vecload.C: New test.
* g++.target/powerpc/vecload1.C: New test.
* gcc.target/powerpc/mma-builtin-1.c: Modify test.
---
 gcc/config.gcc|   4 +-
 gcc/config/rs6000/rs6000-passes.def   |   4 +
 gcc/config/rs6000/rs6000-protos.h |   5 +-
 gcc/config/rs6000/rs6000-vecload-opt.cc   | 432 ++
 gcc/config/rs6000/rs6000.cc   |   8 +-
 gcc/config/rs6000/t-rs6000|   5 +
 gcc/ira-color.cc  | 220 -
 gcc/lra-assigns.cc| 118 -
 gcc/lra-int.h |   2 +
 gcc/lra.cc|   1 +
 gcc/testsuite/g++.target/powerpc/vecload.C|  15 +
 gcc/testsuite/g++.target/powerpc/vecload1.C   |  22 +
 .../gcc.target/powerpc/mma-builtin-1.c|   4 +-
 13 files changed, 816 insertions(+), 24 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload1.C

diff --git a/gcc/config.gcc b/gcc/config.gcc
index f0676c830e8..4cf15e807de 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -518,7 +518,7 @@ or1k*-*-*)
;;
 powerpc*-*-*)
cpu_type=rs6000
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
@@ -555,7 +555,7 @@ riscv*)
;;
 rs6000*-*-*)
extra_options="${extra_options} g.opt fused-madd.opt 
rs6000/rs6000-tables.opt"
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-logue.cc 
\$(srcdir)/config/rs6000/rs6000-call.cc"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
diff --git a/gcc/config/rs6000/rs6000-passes.def 
b/gcc/config/rs6000/rs6000-passes.def
index ca899d5f7af..8bd172dd779 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -29,6 +29,10 @@ along with GCC; see the file COPYING3.  If not see
  for loads and stores.  */
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
 
+  /* Pass to replace adjacent memory addresses lxv instruction with lxvp
+ instruction.  */
+  INSERT_PASS_BEFORE (pass_ira, 1, pass_analyze_vecload);
+
   /* Pass to do the PCREL_OPT optimization that combines the load of an
  external symbol's address along with a single load or store using that
  address as a base register.  */
diff --git a/gcc/config/rs6000/rs6000-protos.h 
b/gcc/config/rs6

Re: [PATCH V1] rs6000: New pass for replacement of adjacent (load) lxv with lxvp

2024-01-15 Thread Ajit Agarwal
Hello All:

Following performance gains for spec2017 FP benchmarks.

554.roms_r 16% gains
544.nab_r  9.98% gains
521.wrf_r  6.89% gains.

Thanks & Regards
Ajit


On 14/01/24 8:55 pm, Ajit Agarwal wrote:
> Hello All:
> 
> This patch add the vecload pass to replace adjacent memory accesses lxv with 
> lxvp
> instructions. This pass is added before ira pass.
> 
> vecload pass removes one of the defined adjacent lxv (load) and replace with 
> lxvp.
> Due to removal of one of the defined loads the allocno is has only uses but
> not defs.
> 
> Due to this IRA pass doesn't assign register pairs like registers in sequence.
> Changes are made in IRA register allocator to assign sequential registers to
> adjacent loads.
> 
> Some of the registers are cleared and are not set as profitable registers due 
> to zero cost is greater than negative costs and checks are added to compare
> positive costs.
> 
> LRA register is changed not to reassign them to different register and form
> the sequential register pairs intact.
> 
> 
> contrib/check_GNU_style.sh run on patch looks good.
> 
> Bootstrapped and regtested for powerpc64-linux-gnu.
> 
> Spec2017 benchmarks are run and I get impressive benefits for some of the FP
> benchmarks.
> 
> Thanks & Regards
> Ajit
> 
> 
> rs6000: New  pass for replacement of adjacent lxv with lxvp.
> 
> New pass to replace adjacent memory addresses lxv with lxvp.
> This pass is registered before ira rtl pass.
> 
> 2024-01-14  Ajit Kumar Agarwal  
> 
> gcc/ChangeLog:
> 
>   * config/rs6000/rs6000-passes.def: Registered vecload pass.
>   * config/rs6000/rs6000-vecload-opt.cc: Add new pass.
>   * config.gcc: Add new executable.
>   * config/rs6000/rs6000-protos.h: Add new prototype for vecload
>   pass.
>   * config/rs6000/rs6000.cc: Add new prototype for vecload pass.
>   * config/rs6000/t-rs6000: Add new rule.
>   * ira-color.cc: Form register pair with adjacent loads.
>   * lra-assigns.cc: Skip modifying register pair assignment.
>   * lra-int.h: Add pseudo_conflict field in lra_reg_p structure.
>   * lra.cc: Initialize pseudo_conflict field.
>   * ira-build.cc: Use of REG_FREQ.
> 
> gcc/testsuite/ChangeLog:
> 
>   * g++.target/powerpc/vecload.C: New test.
>   * g++.target/powerpc/vecload1.C: New test.
>   * gcc.target/powerpc/mma-builtin-1.c: Modify test.
> ---
>  gcc/config.gcc|   4 +-
>  gcc/config/rs6000/rs6000-passes.def   |   4 +
>  gcc/config/rs6000/rs6000-protos.h |   5 +-
>  gcc/config/rs6000/rs6000-vecload-opt.cc   | 432 ++
>  gcc/config/rs6000/rs6000.cc   |   8 +-
>  gcc/config/rs6000/t-rs6000|   5 +
>  gcc/ira-color.cc  | 220 -
>  gcc/lra-assigns.cc| 118 -
>  gcc/lra-int.h |   2 +
>  gcc/lra.cc|   1 +
>  gcc/testsuite/g++.target/powerpc/vecload.C|  15 +
>  gcc/testsuite/g++.target/powerpc/vecload1.C   |  22 +
>  .../gcc.target/powerpc/mma-builtin-1.c|   4 +-
>  13 files changed, 816 insertions(+), 24 deletions(-)
>  create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
>  create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C
>  create mode 100644 gcc/testsuite/g++.target/powerpc/vecload1.C
> 
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index f0676c830e8..4cf15e807de 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -518,7 +518,7 @@ or1k*-*-*)
>   ;;
>  powerpc*-*-*)
>   cpu_type=rs6000
> - extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
> + extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
> rs6000-vecload-opt.o"
>   extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>   extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
>   extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
> @@ -555,7 +555,7 @@ riscv*)
>   ;;
>  rs6000*-*-*)
>   extra_options="${extra_options} g.opt fused-madd.opt 
> rs6000/rs6000-tables.opt"
> - extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
> + extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
> rs6000-vecload-opt.o"
>   extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>   target_gtfiles="$target_gtfiles 
> \$(srcdir)/config/rs6000/rs6000-logue.cc 
> \$(srcdir)/config/rs6000/rs6000-call.cc"
>   target_gtfiles="$target_gtfiles 
> \$(srcdir)/confi

Re: [PATCH V1] rs6000: New pass for replacement of adjacent (load) lxv with lxvp

2024-01-15 Thread Ajit Agarwal
Hello Richard:

On 15/01/24 3:03 pm, Richard Biener wrote:
> On Sun, Jan 14, 2024 at 4:29 PM Ajit Agarwal  wrote:
>>
>> Hello All:
>>
>> This patch add the vecload pass to replace adjacent memory accesses lxv with 
>> lxvp
>> instructions. This pass is added before ira pass.
>>
>> vecload pass removes one of the defined adjacent lxv (load) and replace with 
>> lxvp.
>> Due to removal of one of the defined loads the allocno is has only uses but
>> not defs.
>>
>> Due to this IRA pass doesn't assign register pairs like registers in 
>> sequence.
>> Changes are made in IRA register allocator to assign sequential registers to
>> adjacent loads.
>>
>> Some of the registers are cleared and are not set as profitable registers due
>> to zero cost is greater than negative costs and checks are added to compare
>> positive costs.
>>
>> LRA register is changed not to reassign them to different register and form
>> the sequential register pairs intact.
>>
>>
>> contrib/check_GNU_style.sh run on patch looks good.
>>
>> Bootstrapped and regtested for powerpc64-linux-gnu.
>>
>> Spec2017 benchmarks are run and I get impressive benefits for some of the FP
>> benchmarks.
> i
> I want to point out the aarch64 target recently got a ld/st fusion
> pass which sounds
> related.  It would be nice to have at least common infrastructure for
> this (the aarch64
> one also looks quite more powerful)

load/store fusion pass in aarch64 is scheduled to use before peephole2 pass 
and after register allocator pass. In our case, if we do after register 
allocator
then we should keep register assigned to lower offset load and other load
that is adjacent to previous load with offset difference of 16 is removed.

Then we are left with one load with lower offset and register assigned 
by register allocator for lower offset load should be lower than other
adjacent load. If not, we need to change it to lower register and 
propagate them with all the uses of the variable. Similary for other
adjacent load that we are removing, register needs to be propagated to
all the uses.

In that case we are doing the work of register allocator. In most of our
example testcases the lower offset load is assigned greater register 
than other adjacent load by register allocator and hence we are left
with propagating them always and almost redoing the register allocator
work.

Is it same/okay to use load/store fusion pass as on aarch64 for our cases
considering the above scenario.

Please let me know what do you think. 

Thanks & Regards
Ajit
>> Thanks & Regards
>> Ajit
>>
>>
>> rs6000: New  pass for replacement of adjacent lxv with lxvp.
>>
>> New pass to replace adjacent memory addresses lxv with lxvp.
>> This pass is registered before ira rtl pass.
>>
>> 2024-01-14  Ajit Kumar Agarwal  
>>
>> gcc/ChangeLog:
>>
>> * config/rs6000/rs6000-passes.def: Registered vecload pass.
>> * config/rs6000/rs6000-vecload-opt.cc: Add new pass.
>> * config.gcc: Add new executable.
>> * config/rs6000/rs6000-protos.h: Add new prototype for vecload
>> pass.
>> * config/rs6000/rs6000.cc: Add new prototype for vecload pass.
>> * config/rs6000/t-rs6000: Add new rule.
>> * ira-color.cc: Form register pair with adjacent loads.
>> * lra-assigns.cc: Skip modifying register pair assignment.
>> * lra-int.h: Add pseudo_conflict field in lra_reg_p structure.
>> * lra.cc: Initialize pseudo_conflict field.
>> * ira-build.cc: Use of REG_FREQ.
>>
>> gcc/testsuite/ChangeLog:
>>
>> * g++.target/powerpc/vecload.C: New test.
>> * g++.target/powerpc/vecload1.C: New test.
>> * gcc.target/powerpc/mma-builtin-1.c: Modify test.
>> ---
>>  gcc/config.gcc|   4 +-
>>  gcc/config/rs6000/rs6000-passes.def   |   4 +
>>  gcc/config/rs6000/rs6000-protos.h |   5 +-
>>  gcc/config/rs6000/rs6000-vecload-opt.cc   | 432 ++
>>  gcc/config/rs6000/rs6000.cc   |   8 +-
>>  gcc/config/rs6000/t-rs6000|   5 +
>>  gcc/ira-color.cc  | 220 -
>>  gcc/lra-assigns.cc| 118 -
>>  gcc/lra-int.h |   2 +
>>  gcc/lra.cc|   1 +
>>  gcc/testsuite/g++.target/powerpc/vecload.C|  15 +
>>  gcc/testsuite/g++.target/powerpc/vecload1.C   |  22 +
>>  .../gcc.target/powerpc/mma-builtin-1.c|   

Re: [PATCH V1] rs6000: New pass for replacement of adjacent (load) lxv with lxvp

2024-01-15 Thread Ajit Agarwal



On 15/01/24 6:14 pm, Ajit Agarwal wrote:
> Hello Richard:
> 
> On 15/01/24 3:03 pm, Richard Biener wrote:
>> On Sun, Jan 14, 2024 at 4:29 PM Ajit Agarwal  wrote:
>>>
>>> Hello All:
>>>
>>> This patch add the vecload pass to replace adjacent memory accesses lxv 
>>> with lxvp
>>> instructions. This pass is added before ira pass.
>>>
>>> vecload pass removes one of the defined adjacent lxv (load) and replace 
>>> with lxvp.
>>> Due to removal of one of the defined loads the allocno is has only uses but
>>> not defs.
>>>
>>> Due to this IRA pass doesn't assign register pairs like registers in 
>>> sequence.
>>> Changes are made in IRA register allocator to assign sequential registers to
>>> adjacent loads.
>>>
>>> Some of the registers are cleared and are not set as profitable registers 
>>> due
>>> to zero cost is greater than negative costs and checks are added to compare
>>> positive costs.
>>>
>>> LRA register is changed not to reassign them to different register and form
>>> the sequential register pairs intact.
>>>
>>>
>>> contrib/check_GNU_style.sh run on patch looks good.
>>>
>>> Bootstrapped and regtested for powerpc64-linux-gnu.
>>>
>>> Spec2017 benchmarks are run and I get impressive benefits for some of the FP
>>> benchmarks.
>> i
>> I want to point out the aarch64 target recently got a ld/st fusion
>> pass which sounds
>> related.  It would be nice to have at least common infrastructure for
>> this (the aarch64
>> one also looks quite more powerful)
> 
> load/store fusion pass in aarch64 is scheduled to use before peephole2 pass 
> and after register allocator pass. In our case, if we do after register 
> allocator
> then we should keep register assigned to lower offset load and other load
> that is adjacent to previous load with offset difference of 16 is removed.
> 
> Then we are left with one load with lower offset and register assigned 
> by register allocator for lower offset load should be lower than other
> adjacent load. If not, we need to change it to lower register and 
> propagate them with all the uses of the variable. Similary for other
> adjacent load that we are removing, register needs to be propagated to
> all the uses.
> 
> In that case we are doing the work of register allocator. In most of our
> example testcases the lower offset load is assigned greater register 
> than other adjacent load by register allocator and hence we are left
> with propagating them always and almost redoing the register allocator
> work.
> 
> Is it same/okay to use load/store fusion pass as on aarch64 for our cases
> considering the above scenario.
> 
> Please let me know what do you think. 
> 

Also Mike and Kewwn suggested to use this pass \before IRA register
allocator. They are in To List. They have other concerns doing after 
register allocator.

They have responded in other mail Chain.

Mike and Kewen ! Please respond.

Thanks & Regards
Ajit
> Thanks & Regards
> Ajit
>>> Thanks & Regards
>>> Ajit
>>>
>>>
>>> rs6000: New  pass for replacement of adjacent lxv with lxvp.
>>>
>>> New pass to replace adjacent memory addresses lxv with lxvp.
>>> This pass is registered before ira rtl pass.
>>>
>>> 2024-01-14  Ajit Kumar Agarwal  
>>>
>>> gcc/ChangeLog:
>>>
>>> * config/rs6000/rs6000-passes.def: Registered vecload pass.
>>> * config/rs6000/rs6000-vecload-opt.cc: Add new pass.
>>> * config.gcc: Add new executable.
>>> * config/rs6000/rs6000-protos.h: Add new prototype for vecload
>>> pass.
>>> * config/rs6000/rs6000.cc: Add new prototype for vecload pass.
>>> * config/rs6000/t-rs6000: Add new rule.
>>> * ira-color.cc: Form register pair with adjacent loads.
>>> * lra-assigns.cc: Skip modifying register pair assignment.
>>> * lra-int.h: Add pseudo_conflict field in lra_reg_p structure.
>>> * lra.cc: Initialize pseudo_conflict field.
>>> * ira-build.cc: Use of REG_FREQ.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>> * g++.target/powerpc/vecload.C: New test.
>>> * g++.target/powerpc/vecload1.C: New test.
>>> * gcc.target/powerpc/mma-builtin-1.c: Modify test.
>>> ---
>>>  gcc/config.gcc|   4 +-
>>>  gcc/config/rs6000/rs6000-pass

Re: [PATCH V1] rs6000: New pass for replacement of adjacent (load) lxv with lxvp

2024-01-15 Thread Ajit Agarwal
Hello Richard:

On 15/01/24 6:25 pm, Ajit Agarwal wrote:
> 
> 
> On 15/01/24 6:14 pm, Ajit Agarwal wrote:
>> Hello Richard:
>>
>> On 15/01/24 3:03 pm, Richard Biener wrote:
>>> On Sun, Jan 14, 2024 at 4:29 PM Ajit Agarwal  wrote:
>>>>
>>>> Hello All:
>>>>
>>>> This patch add the vecload pass to replace adjacent memory accesses lxv 
>>>> with lxvp
>>>> instructions. This pass is added before ira pass.
>>>>
>>>> vecload pass removes one of the defined adjacent lxv (load) and replace 
>>>> with lxvp.
>>>> Due to removal of one of the defined loads the allocno is has only uses but
>>>> not defs.
>>>>
>>>> Due to this IRA pass doesn't assign register pairs like registers in 
>>>> sequence.
>>>> Changes are made in IRA register allocator to assign sequential registers 
>>>> to
>>>> adjacent loads.
>>>>
>>>> Some of the registers are cleared and are not set as profitable registers 
>>>> due
>>>> to zero cost is greater than negative costs and checks are added to compare
>>>> positive costs.
>>>>
>>>> LRA register is changed not to reassign them to different register and form
>>>> the sequential register pairs intact.
>>>>
>>>>
>>>> contrib/check_GNU_style.sh run on patch looks good.
>>>>
>>>> Bootstrapped and regtested for powerpc64-linux-gnu.
>>>>
>>>> Spec2017 benchmarks are run and I get impressive benefits for some of the 
>>>> FP
>>>> benchmarks.
>>> i
>>> I want to point out the aarch64 target recently got a ld/st fusion
>>> pass which sounds
>>> related.  It would be nice to have at least common infrastructure for
>>> this (the aarch64
>>> one also looks quite more powerful)
>>
>> load/store fusion pass in aarch64 is scheduled to use before peephole2 pass 
>> and after register allocator pass. In our case, if we do after register 
>> allocator
>> then we should keep register assigned to lower offset load and other load
>> that is adjacent to previous load with offset difference of 16 is removed.
>>
>> Then we are left with one load with lower offset and register assigned 
>> by register allocator for lower offset load should be lower than other
>> adjacent load. If not, we need to change it to lower register and 
>> propagate them with all the uses of the variable. Similary for other
>> adjacent load that we are removing, register needs to be propagated to
>> all the uses.
>>
>> In that case we are doing the work of register allocator. In most of our
>> example testcases the lower offset load is assigned greater register 
>> than other adjacent load by register allocator and hence we are left
>> with propagating them always and almost redoing the register allocator
>> work.
>>
>> Is it same/okay to use load/store fusion pass as on aarch64 for our cases
>> considering the above scenario.
>>
>> Please let me know what do you think. 

I have gone through the implementation of ld/st fusion in aarch64.

Here is my understanding:

First all its my mistake that I have mentioned in my earlier mail that 
this pass is done before peephole2 after RA-pass.

This pass does it before RA-pass early before early-remat and 
also before peephole2 after RA-pass.

This pass does load fusion 2 ldr instruction with adjacent accesses
into ldp instruction.

The assembly syntax of ldp instruction is

ldp w3, w7, [x0]

It loads [X0] into w3 and [X0+4] into W7.

Both registers that forms pairs are mentioned in ldp instructions
and might not be in sequntial order like first register is W3 and
then next register would be W3+1.

Thats why the pass before RA-pass works as it has both the defs
and may not be required in sequential order like first_reg and then
first_reg+1. It can be any valid registers.


But in lxvp instructions:

lxv vs32, 0(r2)
lxv vs45, 16(r2)

When we combine above lxv instruction into lxvp, lxvp instruction
becomes

lxvp vs32, 0(r2)

wherein in lxvp  r2+0 is loaded into vs32 and r2+16 is loaded into vs33 
register (sequential registers). vs33 is hidden in lxvp instruction.
This is mandatory requirement for lxvp instruction and cannot be in 
any other sequence. register assignment difference should be 1.

All the uses of r45 has to be propagated with r33.

And also register allocator can allocate two lxv instructions
in the following registers.

lxv vs33, 0(r2)
lxv vs32, 16(r2)

To generate lxvp for above lxv instructions 

lxvp vs32, 0(r2).

And all the registers vs33 has

Re: [PATCH V1] rs6000: New pass for replacement of adjacent (load) lxv with lxvp

2024-01-17 Thread Ajit Agarwal
Hello Kewen:

On 17/01/24 12:32 pm, Kewen.Lin wrote:
> on 2024/1/16 06:22, Ajit Agarwal wrote:
>> Hello Richard:
>>
>> On 15/01/24 6:25 pm, Ajit Agarwal wrote:
>>>
>>>
>>> On 15/01/24 6:14 pm, Ajit Agarwal wrote:
>>>> Hello Richard:
>>>>
>>>> On 15/01/24 3:03 pm, Richard Biener wrote:
>>>>> On Sun, Jan 14, 2024 at 4:29 PM Ajit Agarwal  
>>>>> wrote:
>>>>>>
>>>>>> Hello All:
>>>>>>
>>>>>> This patch add the vecload pass to replace adjacent memory accesses lxv 
>>>>>> with lxvp
>>>>>> instructions. This pass is added before ira pass.
>>>>>>
>>>>>> vecload pass removes one of the defined adjacent lxv (load) and replace 
>>>>>> with lxvp.
>>>>>> Due to removal of one of the defined loads the allocno is has only uses 
>>>>>> but
>>>>>> not defs.
>>>>>>
>>>>>> Due to this IRA pass doesn't assign register pairs like registers in 
>>>>>> sequence.
>>>>>> Changes are made in IRA register allocator to assign sequential 
>>>>>> registers to
>>>>>> adjacent loads.
>>>>>>
>>>>>> Some of the registers are cleared and are not set as profitable 
>>>>>> registers due
>>>>>> to zero cost is greater than negative costs and checks are added to 
>>>>>> compare
>>>>>> positive costs.
>>>>>>
>>>>>> LRA register is changed not to reassign them to different register and 
>>>>>> form
>>>>>> the sequential register pairs intact.
>>>>>>
>>>>>> contrib/check_GNU_style.sh run on patch looks good.
>>>>>>
>>>>>> Bootstrapped and regtested for powerpc64-linux-gnu.
>>>>>>
>>>>>> Spec2017 benchmarks are run and I get impressive benefits for some of 
>>>>>> the FP
>>>>>> benchmarks.
>>>>> i
>>>>> I want to point out the aarch64 target recently got a ld/st fusion
>>>>> pass which sounds
>>>>> related.  It would be nice to have at least common infrastructure for
>>>>> this (the aarch64
>>>>> one also looks quite more powerful)
> 
> Thank Richi for pointing out this pass.  Yeah, it would be nice if we can 
> share
> something common.  CC the author Alex as well in case he have more insightful
> comments.
> 
>>>>
>>>> load/store fusion pass in aarch64 is scheduled to use before peephole2 
>>>> pass 
>>>> and after register allocator pass. In our case, if we do after register 
>>>> allocator
>>>> then we should keep register assigned to lower offset load and other load
>>>> that is adjacent to previous load with offset difference of 16 is removed.
>>>>
>>>> Then we are left with one load with lower offset and register assigned 
>>>> by register allocator for lower offset load should be lower than other
>>>> adjacent load. If not, we need to change it to lower register and 
>>>> propagate them with all the uses of the variable. Similary for other
>>>> adjacent load that we are removing, register needs to be propagated to
>>>> all the uses.
>>>>
>>>> In that case we are doing the work of register allocator. In most of our
>>>> example testcases the lower offset load is assigned greater register 
>>>> than other adjacent load by register allocator and hence we are left
>>>> with propagating them always and almost redoing the register allocator
>>>> work.
>>>>
>>>> Is it same/okay to use load/store fusion pass as on aarch64 for our cases
>>>> considering the above scenario.
>>>>
>>>> Please let me know what do you think. 
>>
>> I have gone through the implementation of ld/st fusion in aarch64.
>>
>> Here is my understanding:
>>
>> First all its my mistake that I have mentioned in my earlier mail that 
>> this pass is done before peephole2 after RA-pass.
>>
>> This pass does it before RA-pass early before early-remat and 
>> also before peephole2 after RA-pass.
>>
>> This pass does load fusion 2 ldr instruction with adjacent accesses
>> into ldp instruction.
>>
>> The assembly syntax of ldp instruction is
>

Re: [PATCH V1] rs6000: New pass for replacement of adjacent (load) lxv with lxvp

2024-01-18 Thread Ajit Agarwal
Hello Michael:

On 17/01/24 7:58 pm, Michael Matz wrote:
> Hello,
> 
> On Wed, 17 Jan 2024, Ajit Agarwal wrote:
> 
>>> first is even, since OOmode is only ok for even vsx register and its
>>> size makes it take two consecutive vsx registers.
>>>
>>> Hi Peter, is my understanding correct?
>>>
>>
>> I tried all the combination in the past RA is not allocating sequential 
>> register. I dont see any such code in RA that generates sequential 
>> registers.
> 
> See HARD_REGNO_NREGS.  If you form a pseudo of a mode that's larger than a 
> native-sized hardreg (and the target is correctly set up) then the RA will 
> allocate the correct number of hardregs (consecutively) for this pseudo.  
> This is what Kewen was referring to by mentioning the OOmode for the new 
> hypothetical pseudo.  The individual parts of such pseudo will then need 
> to use subreg to access them.
> 
> So, when you work before RA you simply will transform this (I'm going to 
> use SImode and DImode for demonstration):
> 
>(set (reg:SI x) (mem:SI (addr)))
>(set (reg:SI y) (mem:SI (addr+4)))
>...
>( ...use1... (reg:SI x))
>( ...use2... (reg:SI y))
> 
> into this:
> 
>(set (reg:DI z) (mem:DI (addr)))
>...
>( ...use1... (subreg:SI (reg:DI z) 0))
>( ...use2... (subreg:SI (reg:DI z) 4))
> 
> For this to work the target needs to accept the (subreg...) in certain 
> operands of instruction patterns, which I assume was what Kewen also 
> referred to.  The register allocator will then assign hardregs X and X+1 
> to the pseudo-reg 'z'.  (Assuming that DImode is okay for hardreg X, and 
> HARD_REGNO_NREGS says that it needs two hardregs to hold DImode).
> 
> It will also replace the subregs by their appropriate concrete hardreg.
> 
> It seems your problems stem from trying to place your new pass somewhere 
> within the register-allocation pipeline, rather than simply completely 
> before.
> 

Thanks for the suggestions. It worked and with above changes sequential
registers are generated by RA pass.

I am working on common infrastructure with AARCH64 for register pairs
loads and stores pass.

Thanks & Regards
Ajit

> 
> Ciao,
> Michael.


Fwd: [PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).

2024-01-21 Thread Ajit Agarwal


Hello All:

New pass to replace adjacent memory addresses lxv with lxvp.
Added common infrastructure for load store fusion for
different targets.

Common routines are refactored in fusion-common.h.

AARCH64 load/store fusion pass is not changed with the 
common infrastructure.

For AARCH64 architectures just include "fusion-common.h"
and target dependent code is added to that.


Alex/Richard:

If you would like me to add for AARCH64 I can do that for AARCH64.

If you would like to do that is fine with me.

Bootstrapped and regtested with powerpc64-linux-gnu.

Improvement in performance is seen with Spec 2017 spec FP benchmarks.

Thanks & Regards
Ajit

rs6000: New  pass for replacement of adjacent lxv with lxvp.

New pass to replace adjacent memory addresses lxv with lxvp.
Added common infrastructure for load store fusion for
different targets.

Common routines are refactored in fusion-common.h.

2024-01-21  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/rs6000/rs6000-passes.def: New vecload pass
before pass_early_remat.
* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
* config.gcc: Add new executable.
* config/rs6000/rs6000-protos.h: Add new prototype for vecload
pass.
* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
* config/rs6000/t-rs6000: Add new rule.
* fusion-common.h: Add common infrastructure for load store
fusion that can be shared across different architectures.
* emit-rtl.cc: Modify assert code.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/vecload.C: New test.
* g++.target/powerpc/vecload1.C: New test.
* gcc.target/powerpc/mma-builtin-1.c: Modify test.
---
 gcc/config.gcc|4 +-
 gcc/config/rs6000/rs6000-passes.def   |3 +
 gcc/config/rs6000/rs6000-protos.h |1 +
 gcc/config/rs6000/rs6000-vecload-opt.cc   | 1186 
 gcc/config/rs6000/rs6000.cc   |1 +
 gcc/config/rs6000/t-rs6000|5 +
 gcc/emit-rtl.cc   |   14 +-
 gcc/fusion-common.h   | 1195 +
 gcc/testsuite/g++.target/powerpc/vecload.C|   15 +
 gcc/testsuite/g++.target/powerpc/vecload1.C   |   22 +
 .../gcc.target/powerpc/mma-builtin-1.c|4 +-
 11 files changed, 2433 insertions(+), 17 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
 create mode 100644 gcc/fusion-common.h
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload1.C

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 00355509c92..9bff42cf830 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -518,7 +518,7 @@ or1k*-*-*)
;;
 powerpc*-*-*)
cpu_type=rs6000
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
@@ -555,7 +555,7 @@ riscv*)
;;
 rs6000*-*-*)
extra_options="${extra_options} g.opt fused-madd.opt 
rs6000/rs6000-tables.opt"
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-logue.cc 
\$(srcdir)/config/rs6000/rs6000-call.cc"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
diff --git a/gcc/config/rs6000/rs6000-passes.def 
b/gcc/config/rs6000/rs6000-passes.def
index 46a0d0b8c56..eb4a65ebe10 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -28,6 +28,9 @@ along with GCC; see the file COPYING3.  If not see
  The power8 does not have instructions that automaticaly do the byte swaps
  for loads and stores.  */
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
+  /* Pass to replace adjacent memory addresses lxv instruction with lxvp
+ instruction.  */
+  INSERT_PASS_BEFORE (pass_early_remat, 1, pass_analyze_vecload);
 
   /* Pass to do the PCREL_OPT optimization that combines the load of an
  external symbol's address along with a single load or store using that
diff --git a/gcc/config/rs6000/rs6000-protos.h 
b/gcc/config/rs6000/rs6000-protos.h
index 09a57a806fa..f0a9f36602e 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -343,6 +343,7 @@ namespace gcc { class context; }
 class rtl_opt_pass;
 
 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
+extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
 extern rtl_opt

[PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).

2024-01-21 Thread Ajit Agarwal


Hello All:

New pass to replace adjacent memory addresses lxv with lxvp.
Added common infrastructure for load store fusion for
different targets.

Common routines are refactored in fusion-common.h.

AARCH64 load/store fusion pass is not changed with the 
common infrastructure.

For AARCH64 architectures just include "fusion-common.h"
and target dependent code can be added to that.


Alex/Richard:

If you would like me to add for AARCH64 I can do that for AARCH64.

If you would like to do that is fine with me.

Bootstrapped and regtested with powerpc64-linux-gnu.

Improvement in performance is seen with Spec 2017 spec FP benchmarks.

Thanks & Regards
Ajit

rs6000: New  pass for replacement of adjacent lxv with lxvp.

New pass to replace adjacent memory addresses lxv with lxvp.
Added common infrastructure for load store fusion for
different targets.

Common routines are refactored in fusion-common.h.

2024-01-21  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/rs6000/rs6000-passes.def: New vecload pass
before pass_early_remat.
* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
* config.gcc: Add new executable.
* config/rs6000/rs6000-protos.h: Add new prototype for vecload
pass.
* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
* config/rs6000/t-rs6000: Add new rule.
* fusion-common.h: Add common infrastructure for load store
fusion that can be shared across different architectures.
* emit-rtl.cc: Modify assert code.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/vecload.C: New test.
* g++.target/powerpc/vecload1.C: New test.
* gcc.target/powerpc/mma-builtin-1.c: Modify test.
---
 gcc/config.gcc|4 +-
 gcc/config/rs6000/rs6000-passes.def   |3 +
 gcc/config/rs6000/rs6000-protos.h |1 +
 gcc/config/rs6000/rs6000-vecload-opt.cc   | 1186 
 gcc/config/rs6000/rs6000.cc   |1 +
 gcc/config/rs6000/t-rs6000|5 +
 gcc/emit-rtl.cc   |   14 +-
 gcc/fusion-common.h   | 1195 +
 gcc/testsuite/g++.target/powerpc/vecload.C|   15 +
 gcc/testsuite/g++.target/powerpc/vecload1.C   |   22 +
 .../gcc.target/powerpc/mma-builtin-1.c|4 +-
 11 files changed, 2433 insertions(+), 17 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
 create mode 100644 gcc/fusion-common.h
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload1.C

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 00355509c92..9bff42cf830 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -518,7 +518,7 @@ or1k*-*-*)
;;
 powerpc*-*-*)
cpu_type=rs6000
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
@@ -555,7 +555,7 @@ riscv*)
;;
 rs6000*-*-*)
extra_options="${extra_options} g.opt fused-madd.opt 
rs6000/rs6000-tables.opt"
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-logue.cc 
\$(srcdir)/config/rs6000/rs6000-call.cc"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
diff --git a/gcc/config/rs6000/rs6000-passes.def 
b/gcc/config/rs6000/rs6000-passes.def
index 46a0d0b8c56..eb4a65ebe10 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -28,6 +28,9 @@ along with GCC; see the file COPYING3.  If not see
  The power8 does not have instructions that automaticaly do the byte swaps
  for loads and stores.  */
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
+  /* Pass to replace adjacent memory addresses lxv instruction with lxvp
+ instruction.  */
+  INSERT_PASS_BEFORE (pass_early_remat, 1, pass_analyze_vecload);
 
   /* Pass to do the PCREL_OPT optimization that combines the load of an
  external symbol's address along with a single load or store using that
diff --git a/gcc/config/rs6000/rs6000-protos.h 
b/gcc/config/rs6000/rs6000-protos.h
index 09a57a806fa..f0a9f36602e 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -343,6 +343,7 @@ namespace gcc { class context; }
 class rtl_opt_pass;
 
 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
+extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
 extern rtl

Re: [PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).

2024-01-31 Thread Ajit Agarwal
Hello Alex:

Thanks for your valuable review comments.

I am incorporating the comments and would send the patch with rs6000 and
AARCH64 changes.

Thanks & Regards
Ajit

On 24/01/24 10:13 pm, Alex Coplan wrote:
> Hi Ajit,
> 
> On 21/01/2024 19:57, Ajit Agarwal wrote:
>>
>> Hello All:
>>
>> New pass to replace adjacent memory addresses lxv with lxvp.
>> Added common infrastructure for load store fusion for
>> different targets.
> 
> Thanks for this, it would be nice to see the load/store pair pass
> generalized to multiple targets.
> 
> I assume you are targeting GCC 15 for this, as we are in stage 4 at
> the moment?
> 
>>
>> Common routines are refactored in fusion-common.h.
>>
>> AARCH64 load/store fusion pass is not changed with the 
>> common infrastructure.
> 
> I think any patch to generalize the load/store pair fusion pass should
> update the aarch64 code at the same time to use the generic
> infrastructure, instead of duplicating the code.
> 
> As a general comment, I think we should move as much of the code as
> possible to target-independent code, with only the bits that are truly
> target-specific (e.g. deciding which modes to allow for a load/store
> pair operand) in target code.
> 
> In terms of structuring the interface between generic code and target
> code, I think it would be pragmatic to use a class with (in some cases,
> pure) virtual functions that can be overriden by targets to implement
> any target-specific behaviour.
> 
> IMO the generic class should be implemented in its own .cc instead of
> using a header-only approach.  The target code would then define a
> derived class which overrides the virtual functions (where necessary)
> declared in the generic class, and then instantiate the derived class to
> create a target-customized instance of the pass.
> 
> A more traditional GCC approach would be to use optabs and target hooks
> to customize the behaviour of the pass to handle target-specific
> aspects, but:
>  - Target hooks are quite heavyweight, and we'd potentially have to add
>quite a few hooks just for one pass that (at least initially) will
>only be used by a couple of targets.
>  - Using classes allows both sides to easily maintain their own state
>and share that state where appropriate.
> 
> Nit on naming: I understand you want to move away from ldp_fusion, but
> how about pair_fusion or mem_pair_fusion instead of just "fusion" as a
> base name?  IMO just "fusion" isn't very clear as to what the pass is
> trying to achieve.
> 
> In general the code could do with a lot more commentary to explain the
> rationale for various things / explain the high-level intent of the
> code.
> 
> Unfortunately I'm not familiar with the DF framework (I've only really
> worked with RTL-SSA for the aarch64 pass), so I haven't commented on the
> use of that framework, but it would be nice if what you're trying to do
> could be done using RTL-SSA instead of using DF directly.
> 
> Hopefully Richard S can chime in on those aspects.
> 
> My main concerns with the patch at the moment (apart from the code
> duplication) is that it looks like:
> 
>  - The patch removes alias analysis from try_fuse_pair, which is unsafe.
>  - The patch tries to make its own RTL changes inside
>rs6000_gen_load_pair, but it should let fuse_pair make those changes
>using RTL-SSA instead.
> 
> I've left some more specific (but still mostly high-level) comments below.
> 
>>
>> For AARCH64 architectures just include "fusion-common.h"
>> and target dependent code can be added to that.
>>
>>
>> Alex/Richard:
>>
>> If you would like me to add for AARCH64 I can do that for AARCH64.
>>
>> If you would like to do that is fine with me.
>>
>> Bootstrapped and regtested with powerpc64-linux-gnu.
>>
>> Improvement in performance is seen with Spec 2017 spec FP benchmarks.
>>
>> Thanks & Regards
>> Ajit
>>
>> rs6000: New  pass for replacement of adjacent lxv with lxvp.
> 
> Are you looking to handle stores eventually, out of interest?  Looking
> at rs6000-vecload-opt.cc:fusion_bb it looks like you're just handling
> loads at the moment.
> 
>>
>> New pass to replace adjacent memory addresses lxv with lxvp.
>> Added common infrastructure for load store fusion for
>> different targets.
>>
>> Common routines are refactored in fusion-common.h.
> 
> I've just done a very quick scan through this file as it mostly just
> looks to be idential to existing code in aarch64-ldp-fusion.cc.
> 
>>
>> 202

[PATCH] rs6000: New pass for replacement of adjacent lxv with lxvp.

2024-01-09 Thread Ajit Agarwal
Hello All:

This pass is registered before ira rtl pass.
Bootstrapped and regtested for powerpc64-linux-gnu.

No regressions for spec 2017 benchmarks and improvements for some of the
FP and INT benchmarks.

Vladimir:

I did modify IRA and LRA register Allocators. Please review.

Thanks & Regards
Ajit

rs6000: New pass for replacement of adjacent lxv with lxvp.

New pass to replace adjacent memory addresses lxv with lxvp.
This pass is registered before ira rtl pass.

2024-01-09  Ajit Kumar Agarwal  

gcc/ChangeLog:

* config/rs6000/rs6000-passes.def: Registered vecload pass.
* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
* config.gcc: Add new executable.
* config/rs6000/rs6000-protos.h: Add new prototype for vecload
pass.
* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
* config/rs6000/t-rs6000: Add new rule.
* ira-color.cc: Form register pair with adjacent loads.
* lra-assigns.cc: Skip modifying register pair assignment.
* lra-int.h: Add pseudo_conflict field in lra_reg_p structure.
* lra.cc: Initialize pseudo_conflict field.
* ira-build.cc: Use of REG_FREQ.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/vecload.C: New test.
* g++.target/powerpc/vecload1.C: New test.
* gcc.target/powerpc/mma-builtin-1.c: Modify test.
---
 gcc/config.gcc|   4 +-
 gcc/config/rs6000/rs6000-passes.def   |   1 +
 gcc/config/rs6000/rs6000-protos.h |   5 +-
 gcc/config/rs6000/rs6000-vecload-opt.cc   | 395 ++
 gcc/config/rs6000/rs6000.cc   |   8 +-
 gcc/config/rs6000/t-rs6000|   5 +
 gcc/ira-build.cc  |   2 +-
 gcc/ira-color.cc  | 214 +-
 gcc/lra-assigns.cc| 103 -
 gcc/lra-int.h |   1 +
 gcc/lra.cc|   1 +
 gcc/testsuite/g++.target/powerpc/vecload.C|  15 +
 gcc/testsuite/g++.target/powerpc/vecload1.C   |  22 +
 .../gcc.target/powerpc/mma-builtin-1.c|   4 +-
 14 files changed, 766 insertions(+), 14 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload1.C

diff --git a/gcc/config.gcc b/gcc/config.gcc
index f0676c830e8..4cf15e807de 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -518,7 +518,7 @@ or1k*-*-*)
;;
 powerpc*-*-*)
cpu_type=rs6000
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
@@ -555,7 +555,7 @@ riscv*)
;;
 rs6000*-*-*)
extra_options="${extra_options} g.opt fused-madd.opt 
rs6000/rs6000-tables.opt"
-   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+   extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
rs6000-vecload-opt.o"
extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-logue.cc 
\$(srcdir)/config/rs6000/rs6000-call.cc"
target_gtfiles="$target_gtfiles 
\$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
diff --git a/gcc/config/rs6000/rs6000-passes.def 
b/gcc/config/rs6000/rs6000-passes.def
index ca899d5f7af..e6a9810ee24 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -28,6 +28,7 @@ along with GCC; see the file COPYING3.  If not see
  The power8 does not have instructions that automaticaly do the byte swaps
  for loads and stores.  */
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
+  INSERT_PASS_BEFORE (pass_ira, 1, pass_analyze_vecload);
 
   /* Pass to do the PCREL_OPT optimization that combines the load of an
  external symbol's address along with a single load or store using that
diff --git a/gcc/config/rs6000/rs6000-protos.h 
b/gcc/config/rs6000/rs6000-protos.h
index f70118ea40f..83ee773a6f8 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -343,12 +343,15 @@ namespace gcc { class context; }
 class rtl_opt_pass;
 
 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
+extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
 extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
 extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
 extern bool rs6000_quadword_masked_address_p (const_rtx exp);
 extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
 extern rtx rs6000_gen_stvx (enum machine_mode, rtx, rtx);
-
+extern bool mode_supports_dq_form (machine_mode

Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp

2023-12-01 Thread Ajit Agarwal
Hello Kewen:

On 24/11/23 3:01 pm, Kewen.Lin wrote:
> Hi Ajit,
> 
> Don't forget to CC David (CC-ed) :), some comments are inlined below.
> 
> on 2023/10/8 03:04, Ajit Agarwal wrote:
>> Hello All:
>>
>> This patch add new pass to replace contiguous addresses vector load lxv with 
>> mma instruction
>> lxvp.
> 
> IMHO the current binding lxvp (and lxvpx, stxvp{x,}) to MMA looks wrong, it's 
> only
> Power10 and VSX required, these instructions should perform well without MMA 
> support.
> So one patch to separate their support from MMA seems to go first.
> 

I will make the changes for Power10 and VSX.

>> This patch addresses one regressions failure in ARM architecture.
> 
> Could you explain this?  I don't see any test case for this.

I have submitted v1 of the patch and there were regressions failure for Linaro.
I have fixed in version V2.
> 
>> Bootstrapped and regtested with powepc64-linux-gnu.
>>
>> Thanks & Regards
>> Ajit
>>
>>
>> rs6000: Add new pass for replacement of contiguous lxv with lxvp.
>>
>> New pass to replace contiguous addresses lxv with lxvp. This pass
>> is registered after ree rtl pass.> 
>> 2023-10-07  Ajit Kumar Agarwal  
>>
>> gcc/ChangeLog:
>>
>>  * config/rs6000/rs6000-passes.def: Registered vecload pass.
>>  * config/rs6000/rs6000-vecload-opt.cc: Add new pass.
>>  * config.gcc: Add new executable.
>>  * config/rs6000/rs6000-protos.h: Add new prototype for vecload
>>  pass.
>>  * config/rs6000/rs6000.cc: Add new prototype for vecload pass.
>>  * config/rs6000/t-rs6000: Add new rule.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * g++.target/powerpc/vecload.C: New test.
>> ---
>>  gcc/config.gcc |   4 +-
>>  gcc/config/rs6000/rs6000-passes.def|   1 +
>>  gcc/config/rs6000/rs6000-protos.h  |   2 +
>>  gcc/config/rs6000/rs6000-vecload-opt.cc| 234 +
>>  gcc/config/rs6000/rs6000.cc|   3 +-
>>  gcc/config/rs6000/t-rs6000 |   4 +
>>  gcc/testsuite/g++.target/powerpc/vecload.C |  15 ++
>>  7 files changed, 260 insertions(+), 3 deletions(-)
>>  create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
>>  create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C
>>
>> diff --git a/gcc/config.gcc b/gcc/config.gcc
>> index ee46d96bf62..482ab094b89 100644
>> --- a/gcc/config.gcc
>> +++ b/gcc/config.gcc
>> @@ -515,7 +515,7 @@ or1k*-*-*)
>>  ;;
>>  powerpc*-*-*)
>>  cpu_type=rs6000
>> -extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
>> +extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
>> rs6000-vecload-opt.o"
>>  extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>>  extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
>>  extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
>> @@ -552,7 +552,7 @@ riscv*)
>>  ;;
>>  rs6000*-*-*)
>>  extra_options="${extra_options} g.opt fused-madd.opt 
>> rs6000/rs6000-tables.opt"
>> -extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
>> +extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o 
>> rs6000-vecload-opt.o"
>>  extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>>  target_gtfiles="$target_gtfiles 
>> \$(srcdir)/config/rs6000/rs6000-logue.cc 
>> \$(srcdir)/config/rs6000/rs6000-call.cc"
>>  target_gtfiles="$target_gtfiles 
>> \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
>> diff --git a/gcc/config/rs6000/rs6000-passes.def 
>> b/gcc/config/rs6000/rs6000-passes.def
>> index ca899d5f7af..9ecf8ce6a9c 100644
>> --- a/gcc/config/rs6000/rs6000-passes.def
>> +++ b/gcc/config/rs6000/rs6000-passes.def
>> @@ -28,6 +28,7 @@ along with GCC; see the file COPYING3.  If not see
>>   The power8 does not have instructions that automaticaly do the byte 
>> swaps
>>   for loads and stores.  */
>>INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
>> +  INSERT_PASS_AFTER (pass_ree, 1, pass_analyze_vecload);
>>  
>>/* Pass to do the PCREL_OPT optimization that combines the load of an
>>   external symbol's address along with a single load or store using that
>> diff --git a/gcc/config/rs6000/rs6000-protos.h 
>> b/gcc/config/rs6000/rs6000-protos.h
>> index f70118ea40f..9c44bae

Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp

2023-12-01 Thread Ajit Agarwal



On 28/11/23 3:14 pm, Kewen.Lin wrote:
> on 2023/11/28 15:05, Michael Meissner wrote:
>> I tried using this patch to compare with the vector size attribute patch I
>> posted.  I could not build it as a cross compiler on my x86_64 because the
>> assembler gives the following error:
>>
>> Error: operand out of domain (11 is not a multiple of 2) for
>> std_stacktrace-elf.o.  If you look at the assembler, it has combined a lxvp 
>> 11
>> and lxvp 12 into:
>>
>> lxvp 11,0(9)
>>
>> The powerpc architecture requires that registers that are loaded with load
>> vector pair and stored with store vector point instructions only load/store
>> even/odd register pairs, and not odd/even pairs.  Unfortunately, it will mean
>> that this optimization will match less often.
>>
> 
> Yes, the current implementation need some refinements, as comments in [1]:
> 
>> Besides, it seems a bad idea to put this pass after reload? as register 
>> allocation
>> finishes, this pairing has to be restricted by the reg No. (I didn't see any
>> checking on the reg No. relationship for paring btw.)
>>
>> Looking forward to the comments from Segher/David/Peter/Mike etc.
> 
> I wonder if we should consider running such pass before reload instead.

Adding before reload pass deletes one of the lxv and replaced with lxvp. This
fails in reload pass while freeing reg_eqivs as ira populates them and then
vecload pass deletes some of insns and while freeing in reload pass as insn
is already deleted in vecload pass reload pass segfaults.

Moving vecload pass before ira will not make register pairs with lxvp and
in ira and that will be a problem.

Making after reload pass is the only solution I see as ira and reload pass
makes register pairs and vecload pass will be easier with generation of
lxvp.

Thanks & Regards
Ajit
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/638070.html
> 
> BR,
> Kewen


  1   2   3   4   >