[PATCH v4 0/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-02-21 Thread Andre Vieira
Hi,

This is a reworked patch series from.  The main differences are a further split
of patches, where:
[1/5] is arm specific and has been approved before,
[2/5] is target agnostic, has had no substantial changes from v3.
[3/5] new arm specific patch that is split from the original last patch and
annotates across lane instructions that are safe for tail predication if their
tail predicated operands are zeroed.
[4/5] new arm specific patch that could be committed indepdent of series to fix
an obvious issue and remove unused unspecs & iterators.
[5/5] reworked last patch refactoring the implicit predication and some other
validity checks.

Original cover letter:
This patch adds support for Arm's MVE Tail Predicated Low Overhead Loop
feature.

The M-class Arm-ARM:
https://developer.arm.com/documentation/ddi0553/bu/?lang=en
Section B5.5.1 "Loop tail predication" describes the feature
we are adding support for with this patch (although
we only add codegen for DLSTP/LETP instruction loops).

Previously with commit d2ed233cb94 we'd added support for
non-MVE DLS/LE loops through the loop-doloop pass, which, given
a standard MVE loop like:

```
void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int 
n)
{
  while (n > 0)
{
  mve_pred16_t p = vctp16q (n);
  int16x8_t va = vldrhq_z_s16 (a, p);
  int16x8_t vb = vldrhq_z_s16 (b, p);
  int16x8_t vc = vaddq_x_s16 (va, vb, p);
  vstrhq_p_s16 (c, vc, p);
  c+=8;
  a+=8;
  b+=8;
  n-=8;
}
}
```
.. would output:

```

dls lr, lr
.L3:
vctp.16 r3
vmrsip, P0  @ movhi
sxthip, ip
vmsr P0, ip @ movhi
mov r4, r0
vpst
vldrht.16   q2, [r4]
mov r4, r1
vmovq3, q0
vpst
vldrht.16   q1, [r4]
mov r4, r2
vpst
vaddt.i16   q3, q2, q1
subsr3, r3, #8
vpst
vstrht.16   q3, [r4]
addsr0, r0, #16
addsr1, r1, #16
addsr2, r2, #16
le  lr, .L3
```

where the LE instruction will decrement LR by 1, compare and
branch if needed.

(there are also other inefficiencies with the above code, like the
pointless vmrs/sxth/vmsr on the VPR and the adds not being merged
into the vldrht/vstrht as a #16 offsets and some random movs!
But that's different problems...)

The MVE version is similar, except that:
* Instead of DLS/LE the instructions are DLSTP/LETP.
* Instead of pre-calculating the number of iterations of the
  loop, we place the number of elements to be processed by the
  loop into LR.
* Instead of decrementing the LR by one, LETP will decrement it
  by FPSCR.LTPSIZE, which is the number of elements being
  processed in each iteration: 16 for 8-bit elements, 5 for 16-bit
  elements, etc.
* On the final iteration, automatic Loop Tail Predication is
  performed, as if the instructions within the loop had been VPT
  predicated with a VCTP generating the VPR predicate in every
  loop iteration.

The dlstp/letp loop now looks like:

```

dlstp.16lr, r3
.L14:
mov r3, r0
vldrh.16q3, [r3]
mov r3, r1
vldrh.16q2, [r3]
mov r3, r2
vadd.i16  q3, q3, q2
addsr0, r0, #16
vstrh.16q3, [r3]
addsr1, r1, #16
addsr2, r2, #16
letplr, .L14

```

Since the loop tail predication is automatic, we have eliminated
the VCTP that had been specified by the user in the intrinsic
and converted the VPT-predicated instructions into their
unpredicated equivalents (which also saves us from VPST insns).

The LE instruction here decrements LR by 8 in each iteration.

Stam Markianos-Wright (1):
  arm: Add define_attr to to create a mapping between MVE predicated and
unpredicated insns

Andre Vieira (4):
  doloop: Add support for predicated vectorized loops
  arm: Annotate instructions with mve_safe_imp_xlane_pred
  arm: Fix a wrong attribute use and remove unused unspecs and iterators
  arm: Add support for MVE Tail-Predicated Low Overhead Loops


-- 
2.17.1


[PATCH v4 2/5] doloop: Add support for predicated vectorized loops

2024-02-21 Thread Andre Vieira

This patch adds support in the target agnostic doloop pass for the detection of
predicated vectorized hardware loops.  Arm is currently the only target that
will make use of this feature.

The doloop_condition_get function is used to validate that the 'transformed'
jump instruction is one of the conditions that would be correct for the
canonincal loop format where the loop counter is increased by 1 each time.
The way Arm models predicated vectorized hardware loops transforms the loop
counter to 'step' using the element count for each iteration rather than a step
of one.  This means we had to change the condition test in the jump instruction,
meaning we had to add a different condition we accept in doloop_condition_get.

gcc/ChangeLog:

* df-core.cc (df_bb_regno_only_def_find): New helper function.
* df.h (df_bb_regno_only_def_find): Declare new function.
* loop-doloop.cc (doloop_condition_get): Accept conditions generated by
predicated vectorized hardware loops.
(doloop_modify): Add support for GTU condition checks.
(doloop_optimize): Update costing computation to support alterations to
desc->niter_expr by the backend.

Co-authored-by: Stam Markianos-Wright 

diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index f0eb4c93957..b0e8a88d433 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+return temp;
+  else
+return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 84e5aa8b524..c4e690b40cf 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 529e810e530..8953e1de960 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
  forms:
 
  1)  (parallel [(set (pc) (if_then_else (condition)
-	  			(label_ref (label))
-(pc)))
-	 (set (reg) (plus (reg) (const_int -1)))
-	 (additional clobbers and uses)])
+	(label_ref (label))
+	(pc)))
+		 (set (reg) (plus (reg) (const_int -1)))
+		 (additional clobbers and uses)])
 
  The branch must be the first entry of the parallel (also required
  by jump.cc), and the second entry of the parallel must be a set of
@@ -96,19 +96,33 @@ doloop_condition_get (rtx_insn *doloop_pat)
  the loop counter in an if_then_else too.
 
  2)  (set (reg) (plus (reg) (const_int -1))
- (set (pc) (if_then_else (reg != 0)
-	 (label_ref (label))
-			 (pc))).  
+	 (set (pc) (if_then_else (reg != 0)
+ (label_ref (label))
+ (pc))).
 
- Some targets (ARM) do the comparison before the branch, as in the
+ 3) Some targets (Arm) do the comparison before the branch, as in the
  following form:
 
- 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-   (set (reg) (plus (reg) (const_int -1)))])
-(set (pc) (if_then_else (cc == NE)
-(label_ref (label))
-(pc))) */
-
+ (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
+		(set (reg) (plus (reg) (const_int -1)))])
+ (set (pc) (if_then_else (cc == NE)
+			 (label_ref (label))
+			 (pc)))
+
+  4) This form supports a construct that is used to represent a vectorized
+  do loop with predication, however we do not need to care about the
+  details of the predication here.
+  Arm uses this construct to support MVE tail predication.
+
+  (parallel
+   [(set (pc)
+	 (if_then_else (gtu (plus (reg) (const_int -n))
+(const_int n-1))
+			   (label_ref)
+			   (pc)))
+	(set (reg) (plus (reg) (const_int -n)))
+	(additional clobbers and uses)])
+ */
   pattern = PATTERN (doloop_pat);
 
   if (GET_CODE (pattern) != PARALLEL)
@@ -173,15 +187,17 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
 return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int 

[PATCH v4 1/5] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns

2024-02-21 Thread Andre Vieira

This patch adds an attribute to the mve md patterns to be able to identify
predicable MVE instructions and what their predicated and unpredicated variants
are.  This attribute is used to encode the icode of the unpredicated variant of
an instruction in its predicated variant.

This will make it possible for us to transform VPT-predicated insns in
the insn chain into their unpredicated equivalents when transforming the loop
into a MVE Tail-Predicated Low Overhead Loop. For example:
`mve_vldrbq_z_ -> mve_vldrbq_`.

gcc/ChangeLog:

* config/arm/arm.md (mve_unpredicated_insn): New attribute.
* config/arm/arm.h (MVE_VPT_PREDICATED_INSN_P): New define.
(MVE_VPT_UNPREDICATED_INSN_P): Likewise.
(MVE_VPT_PREDICABLE_INSN_P): Likewise.
* config/arm/vec-common.md (mve_vshlq_): Add attribute.
* config/arm/mve.md (arm_vcx1q_p_v16qi): Add attribute.
(arm_vcx1qv16qi): Likewise.
(arm_vcx1qav16qi): Likewise.
(arm_vcx1qv16qi): Likewise.
(arm_vcx2q_p_v16qi): Likewise.
(arm_vcx2qv16qi): Likewise.
(arm_vcx2qav16qi): Likewise.
(arm_vcx2qv16qi): Likewise.
(arm_vcx3q_p_v16qi): Likewise.
(arm_vcx3qv16qi): Likewise.
(arm_vcx3qav16qi): Likewise.
(arm_vcx3qv16qi): Likewise.
(@mve_q_): Likewise.
(@mve_q_int_): Likewise.
(@mve_q_v4si): Likewise.
(@mve_q_n_): Likewise.
(@mve_q_r_): Likewise.
(@mve_q_f): Likewise.
(@mve_q_m_): Likewise.
(@mve_q_m_n_): Likewise.
(@mve_q_m_r_): Likewise.
(@mve_q_m_f): Likewise.
(@mve_q_int_m_): Likewise.
(@mve_q_p_v4si): Likewise.
(@mve_q_p_): Likewise.
(@mve_q_): Likewise.
(@mve_q_f): Likewise.
(@mve_q_m_): Likewise.
(@mve_q_m_f): Likewise.
(mve_vq_f): Likewise.
(mve_q): Likewise.
(mve_q_f): Likewise.
(mve_vadciq_v4si): Likewise.
(mve_vadciq_m_v4si): Likewise.
(mve_vadcq_v4si): Likewise.
(mve_vadcq_m_v4si): Likewise.
(mve_vandq_): Likewise.
(mve_vandq_f): Likewise.
(mve_vandq_m_): Likewise.
(mve_vandq_m_f): Likewise.
(mve_vandq_s): Likewise.
(mve_vandq_u): Likewise.
(mve_vbicq_): Likewise.
(mve_vbicq_f): Likewise.
(mve_vbicq_m_): Likewise.
(mve_vbicq_m_f): Likewise.
(mve_vbicq_m_n_): Likewise.
(mve_vbicq_n_): Likewise.
(mve_vbicq_s): Likewise.
(mve_vbicq_u): Likewise.
(@mve_vclzq_s): Likewise.
(mve_vclzq_u): Likewise.
(@mve_vcmp_q_): Likewise.
(@mve_vcmp_q_n_): Likewise.
(@mve_vcmp_q_f): Likewise.
(@mve_vcmp_q_n_f): Likewise.
(@mve_vcmp_q_m_f): Likewise.
(@mve_vcmp_q_m_n_): Likewise.
(@mve_vcmp_q_m_): Likewise.
(@mve_vcmp_q_m_n_f): Likewise.
(mve_vctpq): Likewise.
(mve_vctpq_m): Likewise.
(mve_vcvtaq_): Likewise.
(mve_vcvtaq_m_): Likewise.
(mve_vcvtbq_f16_f32v8hf): Likewise.
(mve_vcvtbq_f32_f16v4sf): Likewise.
(mve_vcvtbq_m_f16_f32v8hf): Likewise.
(mve_vcvtbq_m_f32_f16v4sf): Likewise.
(mve_vcvtmq_): Likewise.
(mve_vcvtmq_m_): Likewise.
(mve_vcvtnq_): Likewise.
(mve_vcvtnq_m_): Likewise.
(mve_vcvtpq_): Likewise.
(mve_vcvtpq_m_): Likewise.
(mve_vcvtq_from_f_): Likewise.
(mve_vcvtq_m_from_f_): Likewise.
(mve_vcvtq_m_n_from_f_): Likewise.
(mve_vcvtq_m_n_to_f_): Likewise.
(mve_vcvtq_m_to_f_): Likewise.
(mve_vcvtq_n_from_f_): Likewise.
(mve_vcvtq_n_to_f_): Likewise.
(mve_vcvtq_to_f_): Likewise.
(mve_vcvttq_f16_f32v8hf): Likewise.
(mve_vcvttq_f32_f16v4sf): Likewise.
(mve_vcvttq_m_f16_f32v8hf): Likewise.
(mve_vcvttq_m_f32_f16v4sf): Likewise.
(mve_vdwdupq_m_wb_u_insn): Likewise.
(mve_vdwdupq_wb_u_insn): Likewise.
(mve_veorq_s>): Likewise.
(mve_veorq_u>): Likewise.
(mve_veorq_f): Likewise.
(mve_vidupq_m_wb_u_insn): Likewise.
(mve_vidupq_u_insn): Likewise.
(mve_viwdupq_m_wb_u_insn): Likewise.
(mve_viwdupq_wb_u_insn): Likewise.
(mve_vldrbq_): Likewise.
(mve_vldrbq_gather_offset_): Likewise.
(mve_vldrbq_gather_offset_z_): Likewise.
(mve_vldrbq_z_): Likewise.
(mve_vldrdq_gather_base_v2di): Likewise.
(mve_vldrdq_gather_base_wb_v2di_insn): Likewise.
(mve_vldrdq_gather_base_wb_z_v2di_insn): Likewise.
(mve_vldrdq_gather_base_z_v2di): Likewise.
(mve_vldrdq_gather_offset_v2di): Likewise.
(mve_vldrdq_gather_offset_z_v2di): Likewise.
(mve_vldrdq_gather_shifted_offset_v2di): Likewise.
(mve_vldrdq_gather_shifted_offset_z_v2di): Likewise.
(mve_vldrhq_): Likewise.
(mve_vldrhq_fv8hf): Likewise.
(mve_vldrhq_gather_offset_): Likewi

[PATCH v4 3/5] arm: Annotate instructions with mve_safe_imp_xlane_pred

2024-02-21 Thread Andre Vieira

This patch annotates some MVE across lane instructions with a new attribute.
We use this attribute to let the compiler know that these instructions can be
safely implicitly predicated when tail predicating if their operands are
guaranteed to have zeroed tail predicated lanes.  These instructions were
selected because having the value 0 in those lanes or 'tail-predicating' those
lanes have the same effect.

gcc/ChangeLog:

* config/arm/arm.md (mve_safe_imp_xlane_pred): New attribute.
* config/arm/iterators.md (mve_vmaxmin_safe_imp): New iterator
attribute.
* config/arm/mve.md (vaddvq_s, vaddvq_u, vaddlvq_s, vaddlvq_u,
vaddvaq_s, vaddvaq_u, vmaxavq_s, vmaxvq_u, vmladavq_s, vmladavq_u,
vmladavxq_s, vmlsdavq_s, vmlsdavxq_s, vaddlvaq_s, vaddlvaq_u,
vmlaldavq_u, vmlaldavq_s, vmlaldavq_u, vmlaldavxq_s, vmlsldavq_s,
vmlsldavxq_s, vrmlaldavhq_u, vrmlaldavhq_s, vrmlaldavhxq_s,
vrmlsldavhq_s, vrmlsldavhxq_s, vrmlaldavhaq_s, vrmlaldavhaq_u,
vrmlaldavhaxq_s, vrmlsldavhaq_s, vrmlsldavhaxq_s, vabavq_s, vabavq_u,
vmladavaq_u, vmladavaq_s, vmladavaxq_s, vmlsdavaq_s, vmlsdavaxq_s,
vmlaldavaq_s, vmlaldavaq_u, vmlaldavaxq_s, vmlsldavaq_s,
vmlsldavaxq_s): Added mve_safe_imp_xlane_pred.

diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 3f2863adf44..58619393858 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -130,6 +130,12 @@ (define_attr "predicated" "yes,no" (const_string "no"))
 ; encode that it is a predicable instruction.
 (define_attr "mve_unpredicated_insn" "" (symbol_ref "CODE_FOR_nothing"))
 
+; An attribute used by the loop-doloop pass when determining whether it is
+; safe to predicate a MVE instruction, that operates across lanes, and was
+; previously not predicated.  The pass will still check whether all inputs
+; are predicated by the VCTP predication mask.
+(define_attr "mve_safe_imp_xlane_pred" "yes,no" (const_string "no"))
+
 ; LENGTH of an instruction (in bytes)
 (define_attr "length" ""
   (const_int 4))
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 7600bf62531..22b3ddf5637 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -869,6 +869,14 @@ (define_code_attr mve_addsubmul [
 		 (plus "vadd")
 		 ])
 
+(define_int_attr mve_vmaxmin_safe_imp [
+		 (VMAXVQ_U "yes")
+		 (VMAXVQ_S "no")
+		 (VMAXAVQ_S "yes")
+		 (VMINVQ_U "no")
+		 (VMINVQ_S "no")
+		 (VMINAVQ_S "no")])
+
 (define_int_attr mve_cmp_op1 [
 		 (VCMPCSQ_M_U "cs")
 		 (VCMPEQQ_M_S "eq") (VCMPEQQ_M_U "eq")
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 8aa0bded7f0..d7bdcd862f8 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -393,6 +393,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q1"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -529,6 +530,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q1"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -802,6 +804,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1014,6 +1017,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "")
   (set_attr "type" "mve_move")
 ])
 
@@ -1033,6 +1037,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1219,6 +1224,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1450,6 +1456,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%Q0, %R0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1588,6 +1595,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1725,6 +1733,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q2, %q3"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1742,6 +1751,7 @@ (

[PATCH v4 4/5] arm: Fix a wrong attribute use and remove unused unspecs and iterators

2024-02-21 Thread Andre Vieira

This patch fixes the erroneous use of a mode attribute without a mode iterator
in the pattern and removes unused unspecs and iterators.

gcc/ChangeLog:

* config/arm/iterators.md (supf): Remove VMLALDAVXQ_U, VMLALDAVXQ_P_U,
VMLALDAVAXQ_U cases.
(VMLALDAVXQ): Remove iterator.
(VMLALDAVXQ_P): Likewise.
(VMLALDAVAXQ): Likewise.
* config/arm/mve.md (mve_vstrwq_p_fv4sf): Replace use of 
mode iterator attribute with V4BI mode.
* config/arm/unspecs.md (VMLALDAVXQ_U, VMLALDAVXQ_P_U,
VMLALDAVAXQ_U): Remove unused unspecs.

diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 22b3ddf5637..3206bcab4cf 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2370,7 +2370,7 @@ (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VSUBQ_S "s") (VSUBQ_U "u") (VADDVAQ_S "s")
 		   (VADDVAQ_U "u") (VADDLVAQ_S "s") (VADDLVAQ_U "u")
 		   (VBICQ_N_S "s") (VBICQ_N_U "u") (VMLALDAVQ_U "u")
-		   (VMLALDAVQ_S "s") (VMLALDAVXQ_U "u") (VMLALDAVXQ_S "s")
+		   (VMLALDAVQ_S "s") (VMLALDAVXQ_S "s")
 		   (VMOVNBQ_U "u") (VMOVNBQ_S "s") (VMOVNTQ_U "u")
 		   (VMOVNTQ_S "s") (VORRQ_N_S "s") (VORRQ_N_U "u")
 		   (VQMOVNBQ_U "u") (VQMOVNBQ_S "s") (VQMOVNTQ_S "s")
@@ -2412,8 +2412,8 @@ (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VREV16Q_M_S "s") (VREV16Q_M_U "u")
 		   (VQRSHRNTQ_N_U "u") (VMOVNTQ_M_U "u") (VMOVLBQ_M_U "u")
 		   (VMLALDAVAQ_U "u") (VQSHRNBQ_N_U "u") (VSHRNBQ_N_U "u")
-		   (VRSHRNBQ_N_U "u") (VMLALDAVXQ_P_U "u")
-		   (VMVNQ_M_N_U "u") (VQSHRNTQ_N_U "u") (VMLALDAVAXQ_U "u")
+		   (VRSHRNBQ_N_U "u")
+		   (VMVNQ_M_N_U "u") (VQSHRNTQ_N_U "u")
 		   (VQMOVNTQ_M_U "u") (VSHRNTQ_N_U "u") (VCVTMQ_M_S "s")
 		   (VCVTMQ_M_U "u") (VCVTNQ_M_S "s") (VCVTNQ_M_U "u")
 		   (VCVTPQ_M_S "s") (VCVTPQ_M_U "u") (VADDLVAQ_P_S "s")
@@ -2762,7 +2762,6 @@ (define_int_iterator VSUBQ_N [VSUBQ_N_S VSUBQ_N_U])
 (define_int_iterator VADDLVAQ [VADDLVAQ_S VADDLVAQ_U])
 (define_int_iterator VBICQ_N [VBICQ_N_S VBICQ_N_U])
 (define_int_iterator VMLALDAVQ [VMLALDAVQ_U VMLALDAVQ_S])
-(define_int_iterator VMLALDAVXQ [VMLALDAVXQ_U VMLALDAVXQ_S])
 (define_int_iterator VMOVNBQ [VMOVNBQ_U VMOVNBQ_S])
 (define_int_iterator VMOVNTQ [VMOVNTQ_S VMOVNTQ_U])
 (define_int_iterator VORRQ_N [VORRQ_N_U VORRQ_N_S])
@@ -2817,11 +2816,9 @@ (define_int_iterator VMLALDAVAQ [VMLALDAVAQ_S VMLALDAVAQ_U])
 (define_int_iterator VQSHRNBQ_N [VQSHRNBQ_N_U VQSHRNBQ_N_S])
 (define_int_iterator VSHRNBQ_N [VSHRNBQ_N_U VSHRNBQ_N_S])
 (define_int_iterator VRSHRNBQ_N [VRSHRNBQ_N_S VRSHRNBQ_N_U])
-(define_int_iterator VMLALDAVXQ_P [VMLALDAVXQ_P_U VMLALDAVXQ_P_S])
 (define_int_iterator VQMOVNTQ_M [VQMOVNTQ_M_U VQMOVNTQ_M_S])
 (define_int_iterator VMVNQ_M_N [VMVNQ_M_N_U VMVNQ_M_N_S])
 (define_int_iterator VQSHRNTQ_N [VQSHRNTQ_N_U VQSHRNTQ_N_S])
-(define_int_iterator VMLALDAVAXQ [VMLALDAVAXQ_S VMLALDAVAXQ_U])
 (define_int_iterator VSHRNTQ_N [VSHRNTQ_N_S VSHRNTQ_N_U])
 (define_int_iterator VCVTMQ_M [VCVTMQ_M_S VCVTMQ_M_U])
 (define_int_iterator VCVTNQ_M [VCVTNQ_M_S VCVTNQ_M_U])
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index d7bdcd862f8..9fe51298cdc 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -4605,7 +4605,7 @@ (define_insn "mve_vstrwq_p_fv4sf"
   [(set (match_operand:V4SI 0 "mve_memory_operand" "=Ux")
 	(unspec:V4SI
 	 [(match_operand:V4SF 1 "s_register_operand" "w")
-	  (match_operand: 2 "vpr_register_operand" "Up")
+	  (match_operand:V4BI 2 "vpr_register_operand" "Up")
 	  (match_dup 0)]
 	 VSTRWQ_F))]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index b9db306c067..46ac8b37157 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -717,7 +717,6 @@ (define_c_enum "unspec" [
   VCVTBQ_F16_F32
   VCVTTQ_F16_F32
   VMLALDAVQ_U
-  VMLALDAVXQ_U
   VMLALDAVXQ_S
   VMLALDAVQ_S
   VMLSLDAVQ_S
@@ -934,7 +933,6 @@ (define_c_enum "unspec" [
   VSHRNBQ_N_S
   VRSHRNBQ_N_S
   VRSHRNBQ_N_U
-  VMLALDAVXQ_P_U
   VMLALDAVXQ_P_S
   VQMOVNTQ_M_U
   VQMOVNTQ_M_S
@@ -943,7 +941,6 @@ (define_c_enum "unspec" [
   VQSHRNTQ_N_U
   VQSHRNTQ_N_S
   VMLALDAVAXQ_S
-  VMLALDAVAXQ_U
   VSHRNTQ_N_S
   VSHRNTQ_N_U
   VCVTBQ_M_F16_F32


[PATCH v4 5/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-02-21 Thread Andre Vieira

This patch adds support for MVE Tail-Predicated Low Overhead Loops by using the
doloop funcitonality added to support predicated vectorized hardware loops.

gcc/ChangeLog:

* config/arm/arm-protos.h (arm_target_bb_ok_for_lob): Change
declaration to pass basic_block.
(arm_attempt_dlstp_transform): New declaration.
* config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): Define targethook.
(TARGET_PREDICT_DOLOOP_P): Likewise.
(arm_target_bb_ok_for_lob): Adapt condition.
(arm_mve_get_vctp_lanes): New function.
(arm_dl_usage_type): New internal enum.
(arm_get_required_vpr_reg): New function.
(arm_get_required_vpr_reg_param): New function.
(arm_get_required_vpr_reg_ret_val): New function.
(arm_mve_get_loop_vctp): New function.
(arm_mve_insn_predicated_by): New function.
(arm_mve_across_lane_insn_p): New function.
(arm_mve_load_store_insn_p): New function.
(arm_mve_impl_pred_on_outputs_p): New function.
(arm_mve_impl_pred_on_inputs_p): New function.
(arm_last_vect_def_insn): New function.
(arm_mve_impl_predicated_p): New function.
(arm_mve_check_reg_origin_is_num_elems): New function.
(arm_mve_dlstp_check_inc_counter): New function.
(arm_mve_dlstp_check_dec_counter): New function.
(arm_mve_loop_valid_for_dlstp): New function.
(arm_predict_doloop_p): New function.
(arm_loop_unroll_adjust): New function.
(arm_emit_mve_unpredicated_insn_to_seq): New function.
(arm_mve_get_vctp_vec_form): New function.
(arm_attempt_dlstp_transform): New function.
* config/arm/arm.opt (mdlstp): New option.
* config/arm/iteratords.md (dlstp_elemsize, letp_num_lanes,
letp_num_lanes_neg, letp_num_lanes_minus_1): New attributes.
(DLSTP, LETP): New iterators.
(predicated_doloop_end_internal): New pattern.
(dlstp_insn): New pattern.
* config/arm/thumb2.md (doloop_end): Adapt to support tail-predicated
loops.
(doloop_begin): Likewise.
* config/arm/types.md (mve_misc): New mve type to represent
predicated_loop_end insn sequences.
* config/arm/unspecs.md:
(DLSTP8, DLSTP16, DLSTP32, DSLTP64,
LETP8, LETP16, LETP32, LETP64): New unspecs for DLSTP and LETP.

gcc/testsuite/ChangeLog:

* gcc.target/arm/lob.h: Add new helpers.
* gcc.target/arm/lob1.c: Use new helpers.
* gcc.target/arm/lob6.c: Likewise.
* gcc.target/arm/dlstp-compile-asm-1.c: New test.
* gcc.target/arm/dlstp-compile-asm-2.c: New test.
* gcc.target/arm/dlstp-compile-asm-3.c: New test.
* gcc.target/arm/dlstp-int8x16.c: New test.
* gcc.target/arm/dlstp-int8x16-run.c: New test.
* gcc.target/arm/dlstp-int16x8.c: New test.
* gcc.target/arm/dlstp-int16x8-run.c: New test.
* gcc.target/arm/dlstp-int32x4.c: New test.
* gcc.target/arm/dlstp-int32x4-run.c: New test.
* gcc.target/arm/dlstp-int64x2.c: New test.
* gcc.target/arm/dlstp-int64x2-run.c: New test.
* gcc.target/arm/dlstp-invalid-asm.c: New test.

Co-authored-by: Stam Markianos-Wright 

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 2cd560c9925..34d6be76e94 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern int arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index e5a944486d7..4fdcf5ed82a 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -668,6 +668,12 @@ static const scoped_attribute_specs *const arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34483,19 +34489,1279 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
-{
-  basic_block bb = BLOCK_FOR_INSN (insn);
-  /* Make sure the basic block of the target insn is a simple latch
- having as single predecessor and successor the body of the loop
- itself.  Only simple loops with a single basic block as body are
- supported for 'low over head loop' making sure that LE target is
- above LE it

[PATCH v5 0/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-02-22 Thread Andre Vieira
Hi,

This is a reworked patch series from.  The main differences are a further split
of patches, where:
[1/5] is arm specific and has been approved before,
[2/5] is target agnostic, has had no substantial changes from v3.
[3/5] new arm specific patch that is split from the original last patch and
annotates across lane instructions that are safe for tail predication if their
tail predicated operands are zeroed.
[4/5] new arm specific patch that could be committed indepdent of series to fix
an obvious issue and remove unused unspecs & iterators.
[5/5] (v3-v4) reworked last patch refactoring the implicit predication and some 
other
validity checks, (v4-v5) removed the expectation that vctp instructions are
always zero extended after this was fixed on trunk.

Original cover letter:
This patch adds support for Arm's MVE Tail Predicated Low Overhead Loop
feature.

The M-class Arm-ARM:
https://developer.arm.com/documentation/ddi0553/bu/?lang=en
Section B5.5.1 "Loop tail predication" describes the feature
we are adding support for with this patch (although
we only add codegen for DLSTP/LETP instruction loops).

Previously with commit d2ed233cb94 we'd added support for
non-MVE DLS/LE loops through the loop-doloop pass, which, given
a standard MVE loop like:

```
void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int 
n)
{
  while (n > 0)
{
  mve_pred16_t p = vctp16q (n);
  int16x8_t va = vldrhq_z_s16 (a, p);
  int16x8_t vb = vldrhq_z_s16 (b, p);
  int16x8_t vc = vaddq_x_s16 (va, vb, p);
  vstrhq_p_s16 (c, vc, p);
  c+=8;
  a+=8;
  b+=8;
  n-=8;
}
}
```
.. would output:

```

dls lr, lr
.L3:
vctp.16 r3
vmrsip, P0  @ movhi
sxthip, ip
vmsr P0, ip @ movhi
mov r4, r0
vpst
vldrht.16   q2, [r4]
mov r4, r1
vmovq3, q0
vpst
vldrht.16   q1, [r4]
mov r4, r2
vpst
vaddt.i16   q3, q2, q1
subsr3, r3, #8
vpst
vstrht.16   q3, [r4]
addsr0, r0, #16
addsr1, r1, #16
addsr2, r2, #16
le  lr, .L3
```

where the LE instruction will decrement LR by 1, compare and
branch if needed.

(there are also other inefficiencies with the above code, like the
pointless vmrs/sxth/vmsr on the VPR and the adds not being merged
into the vldrht/vstrht as a #16 offsets and some random movs!
But that's different problems...)

The MVE version is similar, except that:
* Instead of DLS/LE the instructions are DLSTP/LETP.
* Instead of pre-calculating the number of iterations of the
  loop, we place the number of elements to be processed by the
  loop into LR.
* Instead of decrementing the LR by one, LETP will decrement it
  by FPSCR.LTPSIZE, which is the number of elements being
  processed in each iteration: 16 for 8-bit elements, 5 for 16-bit
  elements, etc.
* On the final iteration, automatic Loop Tail Predication is
  performed, as if the instructions within the loop had been VPT
  predicated with a VCTP generating the VPR predicate in every
  loop iteration.

The dlstp/letp loop now looks like:

```

dlstp.16lr, r3
.L14:
mov r3, r0
vldrh.16q3, [r3]
mov r3, r1
vldrh.16q2, [r3]
mov r3, r2
vadd.i16  q3, q3, q2
addsr0, r0, #16
vstrh.16q3, [r3]
addsr1, r1, #16
addsr2, r2, #16
letplr, .L14

```

Since the loop tail predication is automatic, we have eliminated
the VCTP that had been specified by the user in the intrinsic
and converted the VPT-predicated instructions into their
unpredicated equivalents (which also saves us from VPST insns).

The LE instruction here decrements LR by 8 in each iteration.

Stam Markianos-Wright (1):
  arm: Add define_attr to to create a mapping between MVE predicated and
unpredicated insns

Andre Vieira (4):
  doloop: Add support for predicated vectorized loops
  arm: Annotate instructions with mve_safe_imp_xlane_pred
  arm: Fix a wrong attribute use and remove unused unspecs and iterators
  arm: Add support for MVE Tail-Predicated Low Overhead Loops


-- 
2.17.1


[PATCH v5 2/5] doloop: Add support for predicated vectorized loops

2024-02-22 Thread Andre Vieira

This patch adds support in the target agnostic doloop pass for the detection of
predicated vectorized hardware loops.  Arm is currently the only target that
will make use of this feature.

gcc/ChangeLog:

* df-core.cc (df_bb_regno_only_def_find): New helper function.
* df.h (df_bb_regno_only_def_find): Declare new function.
* loop-doloop.cc (doloop_condition_get): Add support for detecting
predicated vectorized hardware loops.
(doloop_modify): Add support for GTU condition checks.
(doloop_optimize): Update costing computation to support alterations to
desc->niter_expr by the backend.

Co-authored-by: Stam Markianos-Wright 
---
 gcc/df-core.cc |  15 +
 gcc/df.h   |   1 +
 gcc/loop-doloop.cc | 164 +++--
 3 files changed, 113 insertions(+), 67 deletions(-)

diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index f0eb4c93957..b0e8a88d433 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+return temp;
+  else
+return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 84e5aa8b524..c4e690b40cf 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 529e810e530..8953e1de960 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
  forms:
 
  1)  (parallel [(set (pc) (if_then_else (condition)
-	  			(label_ref (label))
-(pc)))
-	 (set (reg) (plus (reg) (const_int -1)))
-	 (additional clobbers and uses)])
+	(label_ref (label))
+	(pc)))
+		 (set (reg) (plus (reg) (const_int -1)))
+		 (additional clobbers and uses)])
 
  The branch must be the first entry of the parallel (also required
  by jump.cc), and the second entry of the parallel must be a set of
@@ -96,19 +96,33 @@ doloop_condition_get (rtx_insn *doloop_pat)
  the loop counter in an if_then_else too.
 
  2)  (set (reg) (plus (reg) (const_int -1))
- (set (pc) (if_then_else (reg != 0)
-	 (label_ref (label))
-			 (pc))).  
+	 (set (pc) (if_then_else (reg != 0)
+ (label_ref (label))
+ (pc))).
 
- Some targets (ARM) do the comparison before the branch, as in the
+ 3) Some targets (Arm) do the comparison before the branch, as in the
  following form:
 
- 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-   (set (reg) (plus (reg) (const_int -1)))])
-(set (pc) (if_then_else (cc == NE)
-(label_ref (label))
-(pc))) */
-
+ (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
+		(set (reg) (plus (reg) (const_int -1)))])
+ (set (pc) (if_then_else (cc == NE)
+			 (label_ref (label))
+			 (pc)))
+
+  4) This form supports a construct that is used to represent a vectorized
+  do loop with predication, however we do not need to care about the
+  details of the predication here.
+  Arm uses this construct to support MVE tail predication.
+
+  (parallel
+   [(set (pc)
+	 (if_then_else (gtu (plus (reg) (const_int -n))
+(const_int n-1))
+			   (label_ref)
+			   (pc)))
+	(set (reg) (plus (reg) (const_int -n)))
+	(additional clobbers and uses)])
+ */
   pattern = PATTERN (doloop_pat);
 
   if (GET_CODE (pattern) != PARALLEL)
@@ -173,15 +187,17 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
 return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
  On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
 inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
-  || XEXP (inc_src, 0) != reg
-  || XEXP (inc_src, 1) != constm1_rtx)
+  || !rtx_equal_p (XEXP (inc_src, 0), reg)
+  || !CONST_IN

[PATCH v5 3/5] arm: Annotate instructions with mve_safe_imp_xlane_pred

2024-02-22 Thread Andre Vieira

This patch annotates some MVE across lane instructions with a new attribute.
We use this attribute to let the compiler know that these instructions can be
safely implicitly predicated when tail predicating if their operands are
guaranteed to have zeroed tail predicated lanes.  These instructions were
selected because having the value 0 in those lanes or 'tail-predicating' those
lanes have the same effect.

gcc/ChangeLog:

* config/arm/arm.md (mve_safe_imp_xlane_pred): New attribute.
* config/arm/iterators.md (mve_vmaxmin_safe_imp): New iterator
attribute.
* config/arm/mve.md (vaddvq_s, vaddvq_u, vaddlvq_s, vaddlvq_u,
vaddvaq_s, vaddvaq_u, vmaxavq_s, vmaxvq_u, vmladavq_s, vmladavq_u,
vmladavxq_s, vmlsdavq_s, vmlsdavxq_s, vaddlvaq_s, vaddlvaq_u,
vmlaldavq_u, vmlaldavq_s, vmlaldavq_u, vmlaldavxq_s, vmlsldavq_s,
vmlsldavxq_s, vrmlaldavhq_u, vrmlaldavhq_s, vrmlaldavhxq_s,
vrmlsldavhq_s, vrmlsldavhxq_s, vrmlaldavhaq_s, vrmlaldavhaq_u,
vrmlaldavhaxq_s, vrmlsldavhaq_s, vrmlsldavhaxq_s, vabavq_s, vabavq_u,
vmladavaq_u, vmladavaq_s, vmladavaxq_s, vmlsdavaq_s, vmlsdavaxq_s,
vmlaldavaq_s, vmlaldavaq_u, vmlaldavaxq_s, vmlsldavaq_s,
vmlsldavaxq_s): Added mve_safe_imp_xlane_pred.
---
 gcc/config/arm/arm.md   |  6 ++
 gcc/config/arm/iterators.md |  8 
 gcc/config/arm/mve.md   | 12 
 3 files changed, 26 insertions(+)

diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 81290e83818..814e871acea 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -130,6 +130,12 @@ (define_attr "predicated" "yes,no" (const_string "no"))
 ; encode that it is a predicable instruction.
 (define_attr "mve_unpredicated_insn" "" (symbol_ref "CODE_FOR_nothing"))
 
+; An attribute used by the loop-doloop pass when determining whether it is
+; safe to predicate a MVE instruction, that operates across lanes, and was
+; previously not predicated.  The pass will still check whether all inputs
+; are predicated by the VCTP predication mask.
+(define_attr "mve_safe_imp_xlane_pred" "yes,no" (const_string "no"))
+
 ; LENGTH of an instruction (in bytes)
 (define_attr "length" ""
   (const_int 4))
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 7600bf62531..22b3ddf5637 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -869,6 +869,14 @@ (define_code_attr mve_addsubmul [
 		 (plus "vadd")
 		 ])
 
+(define_int_attr mve_vmaxmin_safe_imp [
+		 (VMAXVQ_U "yes")
+		 (VMAXVQ_S "no")
+		 (VMAXAVQ_S "yes")
+		 (VMINVQ_U "no")
+		 (VMINVQ_S "no")
+		 (VMINAVQ_S "no")])
+
 (define_int_attr mve_cmp_op1 [
 		 (VCMPCSQ_M_U "cs")
 		 (VCMPEQQ_M_S "eq") (VCMPEQQ_M_U "eq")
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 8aa0bded7f0..d7bdcd862f8 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -393,6 +393,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q1"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -529,6 +530,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q1"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -802,6 +804,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1014,6 +1017,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "")
   (set_attr "type" "mve_move")
 ])
 
@@ -1033,6 +1037,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1219,6 +1224,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1450,6 +1456,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%Q0, %R0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1588,6 +1595,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1725,6 +1733,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q2, %q3"
  [(set (att

[PATCH v5 4/5] arm: Fix a wrong attribute use and remove unused unspecs and iterators

2024-02-22 Thread Andre Vieira

This patch fixes the erroneous use of a mode attribute without a mode iterator
in the pattern and removes unused unspecs and iterators.

gcc/ChangeLog:

* config/arm/iterators.md (supf): Remove VMLALDAVXQ_U, VMLALDAVXQ_P_U,
VMLALDAVAXQ_U cases.
(VMLALDAVXQ): Remove iterator.
(VMLALDAVXQ_P): Likewise.
(VMLALDAVAXQ): Likewise.
* config/arm/mve.md (mve_vstrwq_p_fv4sf): Replace use of 
mode iterator attribute with V4BI mode.
* config/arm/unspecs.md (VMLALDAVXQ_U, VMLALDAVXQ_P_U,
VMLALDAVAXQ_U): Remove unused unspecs.
---
 gcc/config/arm/iterators.md | 9 +++--
 gcc/config/arm/mve.md   | 2 +-
 gcc/config/arm/unspecs.md   | 3 ---
 3 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 22b3ddf5637..3206bcab4cf 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2370,7 +2370,7 @@ (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VSUBQ_S "s") (VSUBQ_U "u") (VADDVAQ_S "s")
 		   (VADDVAQ_U "u") (VADDLVAQ_S "s") (VADDLVAQ_U "u")
 		   (VBICQ_N_S "s") (VBICQ_N_U "u") (VMLALDAVQ_U "u")
-		   (VMLALDAVQ_S "s") (VMLALDAVXQ_U "u") (VMLALDAVXQ_S "s")
+		   (VMLALDAVQ_S "s") (VMLALDAVXQ_S "s")
 		   (VMOVNBQ_U "u") (VMOVNBQ_S "s") (VMOVNTQ_U "u")
 		   (VMOVNTQ_S "s") (VORRQ_N_S "s") (VORRQ_N_U "u")
 		   (VQMOVNBQ_U "u") (VQMOVNBQ_S "s") (VQMOVNTQ_S "s")
@@ -2412,8 +2412,8 @@ (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VREV16Q_M_S "s") (VREV16Q_M_U "u")
 		   (VQRSHRNTQ_N_U "u") (VMOVNTQ_M_U "u") (VMOVLBQ_M_U "u")
 		   (VMLALDAVAQ_U "u") (VQSHRNBQ_N_U "u") (VSHRNBQ_N_U "u")
-		   (VRSHRNBQ_N_U "u") (VMLALDAVXQ_P_U "u")
-		   (VMVNQ_M_N_U "u") (VQSHRNTQ_N_U "u") (VMLALDAVAXQ_U "u")
+		   (VRSHRNBQ_N_U "u")
+		   (VMVNQ_M_N_U "u") (VQSHRNTQ_N_U "u")
 		   (VQMOVNTQ_M_U "u") (VSHRNTQ_N_U "u") (VCVTMQ_M_S "s")
 		   (VCVTMQ_M_U "u") (VCVTNQ_M_S "s") (VCVTNQ_M_U "u")
 		   (VCVTPQ_M_S "s") (VCVTPQ_M_U "u") (VADDLVAQ_P_S "s")
@@ -2762,7 +2762,6 @@ (define_int_iterator VSUBQ_N [VSUBQ_N_S VSUBQ_N_U])
 (define_int_iterator VADDLVAQ [VADDLVAQ_S VADDLVAQ_U])
 (define_int_iterator VBICQ_N [VBICQ_N_S VBICQ_N_U])
 (define_int_iterator VMLALDAVQ [VMLALDAVQ_U VMLALDAVQ_S])
-(define_int_iterator VMLALDAVXQ [VMLALDAVXQ_U VMLALDAVXQ_S])
 (define_int_iterator VMOVNBQ [VMOVNBQ_U VMOVNBQ_S])
 (define_int_iterator VMOVNTQ [VMOVNTQ_S VMOVNTQ_U])
 (define_int_iterator VORRQ_N [VORRQ_N_U VORRQ_N_S])
@@ -2817,11 +2816,9 @@ (define_int_iterator VMLALDAVAQ [VMLALDAVAQ_S VMLALDAVAQ_U])
 (define_int_iterator VQSHRNBQ_N [VQSHRNBQ_N_U VQSHRNBQ_N_S])
 (define_int_iterator VSHRNBQ_N [VSHRNBQ_N_U VSHRNBQ_N_S])
 (define_int_iterator VRSHRNBQ_N [VRSHRNBQ_N_S VRSHRNBQ_N_U])
-(define_int_iterator VMLALDAVXQ_P [VMLALDAVXQ_P_U VMLALDAVXQ_P_S])
 (define_int_iterator VQMOVNTQ_M [VQMOVNTQ_M_U VQMOVNTQ_M_S])
 (define_int_iterator VMVNQ_M_N [VMVNQ_M_N_U VMVNQ_M_N_S])
 (define_int_iterator VQSHRNTQ_N [VQSHRNTQ_N_U VQSHRNTQ_N_S])
-(define_int_iterator VMLALDAVAXQ [VMLALDAVAXQ_S VMLALDAVAXQ_U])
 (define_int_iterator VSHRNTQ_N [VSHRNTQ_N_S VSHRNTQ_N_U])
 (define_int_iterator VCVTMQ_M [VCVTMQ_M_S VCVTMQ_M_U])
 (define_int_iterator VCVTNQ_M [VCVTNQ_M_S VCVTNQ_M_U])
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index d7bdcd862f8..9fe51298cdc 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -4605,7 +4605,7 @@ (define_insn "mve_vstrwq_p_fv4sf"
   [(set (match_operand:V4SI 0 "mve_memory_operand" "=Ux")
 	(unspec:V4SI
 	 [(match_operand:V4SF 1 "s_register_operand" "w")
-	  (match_operand: 2 "vpr_register_operand" "Up")
+	  (match_operand:V4BI 2 "vpr_register_operand" "Up")
 	  (match_dup 0)]
 	 VSTRWQ_F))]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index b9db306c067..46ac8b37157 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -717,7 +717,6 @@ (define_c_enum "unspec" [
   VCVTBQ_F16_F32
   VCVTTQ_F16_F32
   VMLALDAVQ_U
-  VMLALDAVXQ_U
   VMLALDAVXQ_S
   VMLALDAVQ_S
   VMLSLDAVQ_S
@@ -934,7 +933,6 @@ (define_c_enum "unspec" [
   VSHRNBQ_N_S
   VRSHRNBQ_N_S
   VRSHRNBQ_N_U
-  VMLALDAVXQ_P_U
   VMLALDAVXQ_P_S
   VQMOVNTQ_M_U
   VQMOVNTQ_M_S
@@ -943,7 +941,6 @@ (define_c_enum "unspec" [
   VQSHRNTQ_N_U
   VQSHRNTQ_N_S
   VMLALDAVAXQ_S
-  VMLALDAVAXQ_U
   VSHRNTQ_N_S
   VSHRNTQ_N_U
   VCVTBQ_M_F16_F32


[PATCH v5 1/5] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns

2024-02-22 Thread Andre Vieira

This patch adds an attribute to the mve md patterns to be able to identify
predicable MVE instructions and what their predicated and unpredicated variants
are.  This attribute is used to encode the icode of the unpredicated variant of
an instruction in its predicated variant.

This will make it possible for us to transform VPT-predicated insns in
the insn chain into their unpredicated equivalents when transforming the loop
into a MVE Tail-Predicated Low Overhead Loop. For example:
`mve_vldrbq_z_ -> mve_vldrbq_`.

gcc/ChangeLog:

* config/arm/arm.md (mve_unpredicated_insn): New attribute.
* config/arm/arm.h (MVE_VPT_PREDICATED_INSN_P): New define.
(MVE_VPT_UNPREDICATED_INSN_P): Likewise.
(MVE_VPT_PREDICABLE_INSN_P): Likewise.
* config/arm/vec-common.md (mve_vshlq_): Add attribute.
* config/arm/mve.md (arm_vcx1q_p_v16qi): Add attribute.
(arm_vcx1qv16qi): Likewise.
(arm_vcx1qav16qi): Likewise.
(arm_vcx1qv16qi): Likewise.
(arm_vcx2q_p_v16qi): Likewise.
(arm_vcx2qv16qi): Likewise.
(arm_vcx2qav16qi): Likewise.
(arm_vcx2qv16qi): Likewise.
(arm_vcx3q_p_v16qi): Likewise.
(arm_vcx3qv16qi): Likewise.
(arm_vcx3qav16qi): Likewise.
(arm_vcx3qv16qi): Likewise.
(@mve_q_): Likewise.
(@mve_q_int_): Likewise.
(@mve_q_v4si): Likewise.
(@mve_q_n_): Likewise.
(@mve_q_r_): Likewise.
(@mve_q_f): Likewise.
(@mve_q_m_): Likewise.
(@mve_q_m_n_): Likewise.
(@mve_q_m_r_): Likewise.
(@mve_q_m_f): Likewise.
(@mve_q_int_m_): Likewise.
(@mve_q_p_v4si): Likewise.
(@mve_q_p_): Likewise.
(@mve_q_): Likewise.
(@mve_q_f): Likewise.
(@mve_q_m_): Likewise.
(@mve_q_m_f): Likewise.
(mve_vq_f): Likewise.
(mve_q): Likewise.
(mve_q_f): Likewise.
(mve_vadciq_v4si): Likewise.
(mve_vadciq_m_v4si): Likewise.
(mve_vadcq_v4si): Likewise.
(mve_vadcq_m_v4si): Likewise.
(mve_vandq_): Likewise.
(mve_vandq_f): Likewise.
(mve_vandq_m_): Likewise.
(mve_vandq_m_f): Likewise.
(mve_vandq_s): Likewise.
(mve_vandq_u): Likewise.
(mve_vbicq_): Likewise.
(mve_vbicq_f): Likewise.
(mve_vbicq_m_): Likewise.
(mve_vbicq_m_f): Likewise.
(mve_vbicq_m_n_): Likewise.
(mve_vbicq_n_): Likewise.
(mve_vbicq_s): Likewise.
(mve_vbicq_u): Likewise.
(@mve_vclzq_s): Likewise.
(mve_vclzq_u): Likewise.
(@mve_vcmp_q_): Likewise.
(@mve_vcmp_q_n_): Likewise.
(@mve_vcmp_q_f): Likewise.
(@mve_vcmp_q_n_f): Likewise.
(@mve_vcmp_q_m_f): Likewise.
(@mve_vcmp_q_m_n_): Likewise.
(@mve_vcmp_q_m_): Likewise.
(@mve_vcmp_q_m_n_f): Likewise.
(mve_vctpq): Likewise.
(mve_vctpq_m): Likewise.
(mve_vcvtaq_): Likewise.
(mve_vcvtaq_m_): Likewise.
(mve_vcvtbq_f16_f32v8hf): Likewise.
(mve_vcvtbq_f32_f16v4sf): Likewise.
(mve_vcvtbq_m_f16_f32v8hf): Likewise.
(mve_vcvtbq_m_f32_f16v4sf): Likewise.
(mve_vcvtmq_): Likewise.
(mve_vcvtmq_m_): Likewise.
(mve_vcvtnq_): Likewise.
(mve_vcvtnq_m_): Likewise.
(mve_vcvtpq_): Likewise.
(mve_vcvtpq_m_): Likewise.
(mve_vcvtq_from_f_): Likewise.
(mve_vcvtq_m_from_f_): Likewise.
(mve_vcvtq_m_n_from_f_): Likewise.
(mve_vcvtq_m_n_to_f_): Likewise.
(mve_vcvtq_m_to_f_): Likewise.
(mve_vcvtq_n_from_f_): Likewise.
(mve_vcvtq_n_to_f_): Likewise.
(mve_vcvtq_to_f_): Likewise.
(mve_vcvttq_f16_f32v8hf): Likewise.
(mve_vcvttq_f32_f16v4sf): Likewise.
(mve_vcvttq_m_f16_f32v8hf): Likewise.
(mve_vcvttq_m_f32_f16v4sf): Likewise.
(mve_vdwdupq_m_wb_u_insn): Likewise.
(mve_vdwdupq_wb_u_insn): Likewise.
(mve_veorq_s>): Likewise.
(mve_veorq_u>): Likewise.
(mve_veorq_f): Likewise.
(mve_vidupq_m_wb_u_insn): Likewise.
(mve_vidupq_u_insn): Likewise.
(mve_viwdupq_m_wb_u_insn): Likewise.
(mve_viwdupq_wb_u_insn): Likewise.
(mve_vldrbq_): Likewise.
(mve_vldrbq_gather_offset_): Likewise.
(mve_vldrbq_gather_offset_z_): Likewise.
(mve_vldrbq_z_): Likewise.
(mve_vldrdq_gather_base_v2di): Likewise.
(mve_vldrdq_gather_base_wb_v2di_insn): Likewise.
(mve_vldrdq_gather_base_wb_z_v2di_insn): Likewise.
(mve_vldrdq_gather_base_z_v2di): Likewise.
(mve_vldrdq_gather_offset_v2di): Likewise.
(mve_vldrdq_gather_offset_z_v2di): Likewise.
(mve_vldrdq_gather_shifted_offset_v2di): Likewise.
(mve_vldrdq_gather_shifted_offset_z_v2di): Likewise.
(mve_vldrhq_): Likewise.
(mve_vldrhq_fv8hf): Likewise.
(mve_vldrhq_gather_offset_): Likewi

[PATCH v5 5/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-02-22 Thread Andre Vieira

This patch adds support for MVE Tail-Predicated Low Overhead Loops by using the
doloop funcitonality added to support predicated vectorized hardware loops.

gcc/ChangeLog:

* config/arm/arm-protos.h (arm_target_bb_ok_for_lob): Change
declaration to pass basic_block.
(arm_attempt_dlstp_transform): New declaration.
* config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): Define targethook.
(TARGET_PREDICT_DOLOOP_P): Likewise.
(arm_target_bb_ok_for_lob): Adapt condition.
(arm_mve_get_vctp_lanes): New function.
(arm_dl_usage_type): New internal enum.
(arm_get_required_vpr_reg): New function.
(arm_get_required_vpr_reg_param): New function.
(arm_get_required_vpr_reg_ret_val): New function.
(arm_mve_get_loop_vctp): New function.
(arm_mve_insn_predicated_by): New function.
(arm_mve_across_lane_insn_p): New function.
(arm_mve_load_store_insn_p): New function.
(arm_mve_impl_pred_on_outputs_p): New function.
(arm_mve_impl_pred_on_inputs_p): New function.
(arm_last_vect_def_insn): New function.
(arm_mve_impl_predicated_p): New function.
(arm_mve_check_reg_origin_is_num_elems): New function.
(arm_mve_dlstp_check_inc_counter): New function.
(arm_mve_dlstp_check_dec_counter): New function.
(arm_mve_loop_valid_for_dlstp): New function.
(arm_predict_doloop_p): New function.
(arm_loop_unroll_adjust): New function.
(arm_emit_mve_unpredicated_insn_to_seq): New function.
(arm_attempt_dlstp_transform): New function.
* config/arm/arm.opt (mdlstp): New option.
* config/arm/iteratords.md (dlstp_elemsize, letp_num_lanes,
letp_num_lanes_neg, letp_num_lanes_minus_1): New attributes.
(DLSTP, LETP): New iterators.
(predicated_doloop_end_internal): New pattern.
(dlstp_insn): New pattern.
* config/arm/thumb2.md (doloop_end): Adapt to support tail-predicated
loops.
(doloop_begin): Likewise.
* config/arm/types.md (mve_misc): New mve type to represent
predicated_loop_end insn sequences.
* config/arm/unspecs.md:
(DLSTP8, DLSTP16, DLSTP32, DSLTP64,
LETP8, LETP16, LETP32, LETP64): New unspecs for DLSTP and LETP.

gcc/testsuite/ChangeLog:

* gcc.target/arm/lob.h: Add new helpers.
* gcc.target/arm/lob1.c: Use new helpers.
* gcc.target/arm/lob6.c: Likewise.
* gcc.target/arm/dlstp-compile-asm-1.c: New test.
* gcc.target/arm/dlstp-compile-asm-2.c: New test.
* gcc.target/arm/dlstp-compile-asm-3.c: New test.
* gcc.target/arm/dlstp-int8x16.c: New test.
* gcc.target/arm/dlstp-int8x16-run.c: New test.
* gcc.target/arm/dlstp-int16x8.c: New test.
* gcc.target/arm/dlstp-int16x8-run.c: New test.
* gcc.target/arm/dlstp-int32x4.c: New test.
* gcc.target/arm/dlstp-int32x4-run.c: New test.
* gcc.target/arm/dlstp-int64x2.c: New test.
* gcc.target/arm/dlstp-int64x2-run.c: New test.
* gcc.target/arm/dlstp-invalid-asm.c: New test.

Co-authored-by: Stam Markianos-Wright 
---
 gcc/config/arm/arm-protos.h   |4 +-
 gcc/config/arm/arm.cc | 1249 -
 gcc/config/arm/arm.opt|3 +
 gcc/config/arm/iterators.md   |   15 +
 gcc/config/arm/mve.md |   50 +
 gcc/config/arm/thumb2.md  |  138 +-
 gcc/config/arm/types.md   |6 +-
 gcc/config/arm/unspecs.md |   14 +-
 gcc/testsuite/gcc.target/arm/lob.h|  128 +-
 gcc/testsuite/gcc.target/arm/lob1.c   |   23 +-
 gcc/testsuite/gcc.target/arm/lob6.c   |8 +-
 .../gcc.target/arm/mve/dlstp-compile-asm-1.c  |  146 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-2.c  |  749 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-3.c  |   46 +
 .../gcc.target/arm/mve/dlstp-int16x8-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int16x8.c|   31 +
 .../gcc.target/arm/mve/dlstp-int32x4-run.c|   45 +
 .../gcc.target/arm/mve/dlstp-int32x4.c|   31 +
 .../gcc.target/arm/mve/dlstp-int64x2-run.c|   48 +
 .../gcc.target/arm/mve/dlstp-int64x2.c|   28 +
 .../gcc.target/arm/mve/dlstp-int8x16-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int8x16.c|   32 +
 .../gcc.target/arm/mve/dlstp-invalid-asm.c|  521 +++
 23 files changed, 3321 insertions(+), 82 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-1.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-3.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
 create mode 100644 gcc/testsuite/gcc.target

[PATCH v5 0/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-02-27 Thread Andre Vieira
Hi,

Re-ordered patches, our latest plan is to only commit patches 1-3, and leave
4-5 for GCC 15, as we believe it is too late in Stage 4 to be making changes to
target agnostic parts, especially since these affect so many ports that we can
not easily test.

[1/5] arm: Add define_attr to to create a mapping between MVE predicated and 
unpredicated insns
[2/5] arm: Annotate instructions with mve_safe_imp_xlane_pred
[3/5] arm: Fix a wrong attribute use and remove unused unspecs and iterators
[4/5] doloop: Add support for predicated vectorized loops
[5/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

Original cover letter:
This patch adds support for Arm's MVE Tail Predicated Low Overhead Loop
feature.

The M-class Arm-ARM:
https://developer.arm.com/documentation/ddi0553/bu/?lang=en
Section B5.5.1 "Loop tail predication" describes the feature
we are adding support for with this patch (although
we only add codegen for DLSTP/LETP instruction loops).

Previously with commit d2ed233cb94 we'd added support for
non-MVE DLS/LE loops through the loop-doloop pass, which, given
a standard MVE loop like:

```
void  __attribute__ ((noinline)) test (int16_t *a, int16_t *b, int16_t *c, int 
n)
{
  while (n > 0)
{
  mve_pred16_t p = vctp16q (n);
  int16x8_t va = vldrhq_z_s16 (a, p);
  int16x8_t vb = vldrhq_z_s16 (b, p);
  int16x8_t vc = vaddq_x_s16 (va, vb, p);
  vstrhq_p_s16 (c, vc, p);
  c+=8;
  a+=8;
  b+=8;
  n-=8;
}
}
```
.. would output:

```

dls lr, lr
.L3:
vctp.16 r3
vmrsip, P0  @ movhi
sxthip, ip
vmsr P0, ip @ movhi
mov r4, r0
vpst
vldrht.16   q2, [r4]
mov r4, r1
vmovq3, q0
vpst
vldrht.16   q1, [r4]
mov r4, r2
vpst
vaddt.i16   q3, q2, q1
subsr3, r3, #8
vpst
vstrht.16   q3, [r4]
addsr0, r0, #16
addsr1, r1, #16
addsr2, r2, #16
le  lr, .L3
```

where the LE instruction will decrement LR by 1, compare and
branch if needed.

(there are also other inefficiencies with the above code, like the
pointless vmrs/sxth/vmsr on the VPR and the adds not being merged
into the vldrht/vstrht as a #16 offsets and some random movs!
But that's different problems...)

The MVE version is similar, except that:
* Instead of DLS/LE the instructions are DLSTP/LETP.
* Instead of pre-calculating the number of iterations of the
  loop, we place the number of elements to be processed by the
  loop into LR.
* Instead of decrementing the LR by one, LETP will decrement it
  by FPSCR.LTPSIZE, which is the number of elements being
  processed in each iteration: 16 for 8-bit elements, 5 for 16-bit
  elements, etc.
* On the final iteration, automatic Loop Tail Predication is
  performed, as if the instructions within the loop had been VPT
  predicated with a VCTP generating the VPR predicate in every
  loop iteration.

The dlstp/letp loop now looks like:

```

dlstp.16lr, r3
.L14:
mov r3, r0
vldrh.16q3, [r3]
mov r3, r1
vldrh.16q2, [r3]
mov r3, r2
vadd.i16  q3, q3, q2
addsr0, r0, #16
vstrh.16q3, [r3]
addsr1, r1, #16
addsr2, r2, #16
letplr, .L14

```

Since the loop tail predication is automatic, we have eliminated
the VCTP that had been specified by the user in the intrinsic
and converted the VPT-predicated instructions into their
unpredicated equivalents (which also saves us from VPST insns).

The LE instruction here decrements LR by 8 in each iteration.

Stam Markianos-Wright (1):
  arm: Add define_attr to to create a mapping between MVE predicated and
    unpredicated insns

Andre Vieira (4):
  arm: Annotate instructions with mve_safe_imp_xlane_pred
  arm: Fix a wrong attribute use and remove unused unspecs and iterators
  doloop: Add support for predicated vectorized loops
  arm: Add support for MVE Tail-Predicated Low Overhead Loops


-- 
2.17.1


[PATCH v6 2/5] arm: Annotate instructions with mve_safe_imp_xlane_pred

2024-02-27 Thread Andre Vieira

This patch annotates some MVE across lane instructions with a new attribute.
We use this attribute to let the compiler know that these instructions can be
safely implicitly predicated when tail predicating if their operands are
guaranteed to have zeroed tail predicated lanes.  These instructions were
selected because having the value 0 in those lanes or 'tail-predicating' those
lanes have the same effect.

gcc/ChangeLog:

* config/arm/arm.md (mve_safe_imp_xlane_pred): New attribute.
* config/arm/iterators.md (mve_vmaxmin_safe_imp): New iterator
attribute.
* config/arm/mve.md (vaddvq_s, vaddvq_u, vaddlvq_s, vaddlvq_u,
vaddvaq_s, vaddvaq_u, vmaxavq_s, vmaxvq_u, vmladavq_s, vmladavq_u,
vmladavxq_s, vmlsdavq_s, vmlsdavxq_s, vaddlvaq_s, vaddlvaq_u,
vmlaldavq_u, vmlaldavq_s, vmlaldavq_u, vmlaldavxq_s, vmlsldavq_s,
vmlsldavxq_s, vrmlaldavhq_u, vrmlaldavhq_s, vrmlaldavhxq_s,
vrmlsldavhq_s, vrmlsldavhxq_s, vrmlaldavhaq_s, vrmlaldavhaq_u,
vrmlaldavhaxq_s, vrmlsldavhaq_s, vrmlsldavhaxq_s, vabavq_s, vabavq_u,
vmladavaq_u, vmladavaq_s, vmladavaxq_s, vmlsdavaq_s, vmlsdavaxq_s,
vmlaldavaq_s, vmlaldavaq_u, vmlaldavaxq_s, vmlsldavaq_s,
vmlsldavaxq_s): Added mve_safe_imp_xlane_pred.
---
 gcc/config/arm/arm.md   |  6 ++
 gcc/config/arm/iterators.md |  8 
 gcc/config/arm/mve.md   | 12 
 3 files changed, 26 insertions(+)

diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 81290e83818..814e871acea 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -130,6 +130,12 @@ (define_attr "predicated" "yes,no" (const_string "no"))
 ; encode that it is a predicable instruction.
 (define_attr "mve_unpredicated_insn" "" (symbol_ref "CODE_FOR_nothing"))
 
+; An attribute used by the loop-doloop pass when determining whether it is
+; safe to predicate a MVE instruction, that operates across lanes, and was
+; previously not predicated.  The pass will still check whether all inputs
+; are predicated by the VCTP predication mask.
+(define_attr "mve_safe_imp_xlane_pred" "yes,no" (const_string "no"))
+
 ; LENGTH of an instruction (in bytes)
 (define_attr "length" ""
   (const_int 4))
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 7600bf62531..22b3ddf5637 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -869,6 +869,14 @@ (define_code_attr mve_addsubmul [
 		 (plus "vadd")
 		 ])
 
+(define_int_attr mve_vmaxmin_safe_imp [
+		 (VMAXVQ_U "yes")
+		 (VMAXVQ_S "no")
+		 (VMAXAVQ_S "yes")
+		 (VMINVQ_U "no")
+		 (VMINVQ_S "no")
+		 (VMINAVQ_S "no")])
+
 (define_int_attr mve_cmp_op1 [
 		 (VCMPCSQ_M_U "cs")
 		 (VCMPEQQ_M_S "eq") (VCMPEQQ_M_U "eq")
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 8aa0bded7f0..d7bdcd862f8 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -393,6 +393,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q1"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -529,6 +530,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q1"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -802,6 +804,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1014,6 +1017,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "")
   (set_attr "type" "mve_move")
 ])
 
@@ -1033,6 +1037,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1219,6 +1224,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1450,6 +1456,7 @@ (define_insn "@mve_q_"
   "TARGET_HAVE_MVE"
   ".%#\t%Q0, %R0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1588,6 +1595,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q1, %q2"
  [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_v4si"))
+  (set_attr "mve_safe_imp_xlane_pred" "yes")
   (set_attr "type" "mve_move")
 ])
 
@@ -1725,6 +1733,7 @@ (define_insn "@mve_q_v4si"
   "TARGET_HAVE_MVE"
   ".32\t%Q0, %R0, %q2, %q3"
  [(set (att

[PATCH v6 3/5] arm: Fix a wrong attribute use and remove unused unspecs and iterators

2024-02-27 Thread Andre Vieira

This patch fixes the erroneous use of a mode attribute without a mode iterator
in the pattern and removes unused unspecs and iterators.

gcc/ChangeLog:

* config/arm/iterators.md (supf): Remove VMLALDAVXQ_U, VMLALDAVXQ_P_U,
VMLALDAVAXQ_U cases.
(VMLALDAVXQ): Remove iterator.
(VMLALDAVXQ_P): Likewise.
(VMLALDAVAXQ): Likewise.
* config/arm/mve.md (mve_vstrwq_p_fv4sf): Replace use of 
mode iterator attribute with V4BI mode.
* config/arm/unspecs.md (VMLALDAVXQ_U, VMLALDAVXQ_P_U,
VMLALDAVAXQ_U): Remove unused unspecs.
---
 gcc/config/arm/iterators.md | 9 +++--
 gcc/config/arm/mve.md   | 2 +-
 gcc/config/arm/unspecs.md   | 3 ---
 3 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 22b3ddf5637..3206bcab4cf 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2370,7 +2370,7 @@ (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VSUBQ_S "s") (VSUBQ_U "u") (VADDVAQ_S "s")
 		   (VADDVAQ_U "u") (VADDLVAQ_S "s") (VADDLVAQ_U "u")
 		   (VBICQ_N_S "s") (VBICQ_N_U "u") (VMLALDAVQ_U "u")
-		   (VMLALDAVQ_S "s") (VMLALDAVXQ_U "u") (VMLALDAVXQ_S "s")
+		   (VMLALDAVQ_S "s") (VMLALDAVXQ_S "s")
 		   (VMOVNBQ_U "u") (VMOVNBQ_S "s") (VMOVNTQ_U "u")
 		   (VMOVNTQ_S "s") (VORRQ_N_S "s") (VORRQ_N_U "u")
 		   (VQMOVNBQ_U "u") (VQMOVNBQ_S "s") (VQMOVNTQ_S "s")
@@ -2412,8 +2412,8 @@ (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VREV16Q_M_S "s") (VREV16Q_M_U "u")
 		   (VQRSHRNTQ_N_U "u") (VMOVNTQ_M_U "u") (VMOVLBQ_M_U "u")
 		   (VMLALDAVAQ_U "u") (VQSHRNBQ_N_U "u") (VSHRNBQ_N_U "u")
-		   (VRSHRNBQ_N_U "u") (VMLALDAVXQ_P_U "u")
-		   (VMVNQ_M_N_U "u") (VQSHRNTQ_N_U "u") (VMLALDAVAXQ_U "u")
+		   (VRSHRNBQ_N_U "u")
+		   (VMVNQ_M_N_U "u") (VQSHRNTQ_N_U "u")
 		   (VQMOVNTQ_M_U "u") (VSHRNTQ_N_U "u") (VCVTMQ_M_S "s")
 		   (VCVTMQ_M_U "u") (VCVTNQ_M_S "s") (VCVTNQ_M_U "u")
 		   (VCVTPQ_M_S "s") (VCVTPQ_M_U "u") (VADDLVAQ_P_S "s")
@@ -2762,7 +2762,6 @@ (define_int_iterator VSUBQ_N [VSUBQ_N_S VSUBQ_N_U])
 (define_int_iterator VADDLVAQ [VADDLVAQ_S VADDLVAQ_U])
 (define_int_iterator VBICQ_N [VBICQ_N_S VBICQ_N_U])
 (define_int_iterator VMLALDAVQ [VMLALDAVQ_U VMLALDAVQ_S])
-(define_int_iterator VMLALDAVXQ [VMLALDAVXQ_U VMLALDAVXQ_S])
 (define_int_iterator VMOVNBQ [VMOVNBQ_U VMOVNBQ_S])
 (define_int_iterator VMOVNTQ [VMOVNTQ_S VMOVNTQ_U])
 (define_int_iterator VORRQ_N [VORRQ_N_U VORRQ_N_S])
@@ -2817,11 +2816,9 @@ (define_int_iterator VMLALDAVAQ [VMLALDAVAQ_S VMLALDAVAQ_U])
 (define_int_iterator VQSHRNBQ_N [VQSHRNBQ_N_U VQSHRNBQ_N_S])
 (define_int_iterator VSHRNBQ_N [VSHRNBQ_N_U VSHRNBQ_N_S])
 (define_int_iterator VRSHRNBQ_N [VRSHRNBQ_N_S VRSHRNBQ_N_U])
-(define_int_iterator VMLALDAVXQ_P [VMLALDAVXQ_P_U VMLALDAVXQ_P_S])
 (define_int_iterator VQMOVNTQ_M [VQMOVNTQ_M_U VQMOVNTQ_M_S])
 (define_int_iterator VMVNQ_M_N [VMVNQ_M_N_U VMVNQ_M_N_S])
 (define_int_iterator VQSHRNTQ_N [VQSHRNTQ_N_U VQSHRNTQ_N_S])
-(define_int_iterator VMLALDAVAXQ [VMLALDAVAXQ_S VMLALDAVAXQ_U])
 (define_int_iterator VSHRNTQ_N [VSHRNTQ_N_S VSHRNTQ_N_U])
 (define_int_iterator VCVTMQ_M [VCVTMQ_M_S VCVTMQ_M_U])
 (define_int_iterator VCVTNQ_M [VCVTNQ_M_S VCVTNQ_M_U])
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index d7bdcd862f8..9fe51298cdc 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -4605,7 +4605,7 @@ (define_insn "mve_vstrwq_p_fv4sf"
   [(set (match_operand:V4SI 0 "mve_memory_operand" "=Ux")
 	(unspec:V4SI
 	 [(match_operand:V4SF 1 "s_register_operand" "w")
-	  (match_operand: 2 "vpr_register_operand" "Up")
+	  (match_operand:V4BI 2 "vpr_register_operand" "Up")
 	  (match_dup 0)]
 	 VSTRWQ_F))]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index b9db306c067..46ac8b37157 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -717,7 +717,6 @@ (define_c_enum "unspec" [
   VCVTBQ_F16_F32
   VCVTTQ_F16_F32
   VMLALDAVQ_U
-  VMLALDAVXQ_U
   VMLALDAVXQ_S
   VMLALDAVQ_S
   VMLSLDAVQ_S
@@ -934,7 +933,6 @@ (define_c_enum "unspec" [
   VSHRNBQ_N_S
   VRSHRNBQ_N_S
   VRSHRNBQ_N_U
-  VMLALDAVXQ_P_U
   VMLALDAVXQ_P_S
   VQMOVNTQ_M_U
   VQMOVNTQ_M_S
@@ -943,7 +941,6 @@ (define_c_enum "unspec" [
   VQSHRNTQ_N_U
   VQSHRNTQ_N_S
   VMLALDAVAXQ_S
-  VMLALDAVAXQ_U
   VSHRNTQ_N_S
   VSHRNTQ_N_U
   VCVTBQ_M_F16_F32


[PATCH v6 1/5] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns

2024-02-27 Thread Andre Vieira

This patch adds an attribute to the mve md patterns to be able to identify
predicable MVE instructions and what their predicated and unpredicated variants
are.  This attribute is used to encode the icode of the unpredicated variant of
an instruction in its predicated variant.

This will make it possible for us to transform VPT-predicated insns in
the insn chain into their unpredicated equivalents when transforming the loop
into a MVE Tail-Predicated Low Overhead Loop. For example:
`mve_vldrbq_z_ -> mve_vldrbq_`.

gcc/ChangeLog:

* config/arm/arm.md (mve_unpredicated_insn): New attribute.
* config/arm/arm.h (MVE_VPT_PREDICATED_INSN_P): New define.
(MVE_VPT_UNPREDICATED_INSN_P): Likewise.
(MVE_VPT_PREDICABLE_INSN_P): Likewise.
* config/arm/vec-common.md (mve_vshlq_): Add attribute.
* config/arm/mve.md (arm_vcx1q_p_v16qi): Add attribute.
(arm_vcx1qv16qi): Likewise.
(arm_vcx1qav16qi): Likewise.
(arm_vcx1qv16qi): Likewise.
(arm_vcx2q_p_v16qi): Likewise.
(arm_vcx2qv16qi): Likewise.
(arm_vcx2qav16qi): Likewise.
(arm_vcx2qv16qi): Likewise.
(arm_vcx3q_p_v16qi): Likewise.
(arm_vcx3qv16qi): Likewise.
(arm_vcx3qav16qi): Likewise.
(arm_vcx3qv16qi): Likewise.
(@mve_q_): Likewise.
(@mve_q_int_): Likewise.
(@mve_q_v4si): Likewise.
(@mve_q_n_): Likewise.
(@mve_q_r_): Likewise.
(@mve_q_f): Likewise.
(@mve_q_m_): Likewise.
(@mve_q_m_n_): Likewise.
(@mve_q_m_r_): Likewise.
(@mve_q_m_f): Likewise.
(@mve_q_int_m_): Likewise.
(@mve_q_p_v4si): Likewise.
(@mve_q_p_): Likewise.
(@mve_q_): Likewise.
(@mve_q_f): Likewise.
(@mve_q_m_): Likewise.
(@mve_q_m_f): Likewise.
(mve_vq_f): Likewise.
(mve_q): Likewise.
(mve_q_f): Likewise.
(mve_vadciq_v4si): Likewise.
(mve_vadciq_m_v4si): Likewise.
(mve_vadcq_v4si): Likewise.
(mve_vadcq_m_v4si): Likewise.
(mve_vandq_): Likewise.
(mve_vandq_f): Likewise.
(mve_vandq_m_): Likewise.
(mve_vandq_m_f): Likewise.
(mve_vandq_s): Likewise.
(mve_vandq_u): Likewise.
(mve_vbicq_): Likewise.
(mve_vbicq_f): Likewise.
(mve_vbicq_m_): Likewise.
(mve_vbicq_m_f): Likewise.
(mve_vbicq_m_n_): Likewise.
(mve_vbicq_n_): Likewise.
(mve_vbicq_s): Likewise.
(mve_vbicq_u): Likewise.
(@mve_vclzq_s): Likewise.
(mve_vclzq_u): Likewise.
(@mve_vcmp_q_): Likewise.
(@mve_vcmp_q_n_): Likewise.
(@mve_vcmp_q_f): Likewise.
(@mve_vcmp_q_n_f): Likewise.
(@mve_vcmp_q_m_f): Likewise.
(@mve_vcmp_q_m_n_): Likewise.
(@mve_vcmp_q_m_): Likewise.
(@mve_vcmp_q_m_n_f): Likewise.
(mve_vctpq): Likewise.
(mve_vctpq_m): Likewise.
(mve_vcvtaq_): Likewise.
(mve_vcvtaq_m_): Likewise.
(mve_vcvtbq_f16_f32v8hf): Likewise.
(mve_vcvtbq_f32_f16v4sf): Likewise.
(mve_vcvtbq_m_f16_f32v8hf): Likewise.
(mve_vcvtbq_m_f32_f16v4sf): Likewise.
(mve_vcvtmq_): Likewise.
(mve_vcvtmq_m_): Likewise.
(mve_vcvtnq_): Likewise.
(mve_vcvtnq_m_): Likewise.
(mve_vcvtpq_): Likewise.
(mve_vcvtpq_m_): Likewise.
(mve_vcvtq_from_f_): Likewise.
(mve_vcvtq_m_from_f_): Likewise.
(mve_vcvtq_m_n_from_f_): Likewise.
(mve_vcvtq_m_n_to_f_): Likewise.
(mve_vcvtq_m_to_f_): Likewise.
(mve_vcvtq_n_from_f_): Likewise.
(mve_vcvtq_n_to_f_): Likewise.
(mve_vcvtq_to_f_): Likewise.
(mve_vcvttq_f16_f32v8hf): Likewise.
(mve_vcvttq_f32_f16v4sf): Likewise.
(mve_vcvttq_m_f16_f32v8hf): Likewise.
(mve_vcvttq_m_f32_f16v4sf): Likewise.
(mve_vdwdupq_m_wb_u_insn): Likewise.
(mve_vdwdupq_wb_u_insn): Likewise.
(mve_veorq_s>): Likewise.
(mve_veorq_u>): Likewise.
(mve_veorq_f): Likewise.
(mve_vidupq_m_wb_u_insn): Likewise.
(mve_vidupq_u_insn): Likewise.
(mve_viwdupq_m_wb_u_insn): Likewise.
(mve_viwdupq_wb_u_insn): Likewise.
(mve_vldrbq_): Likewise.
(mve_vldrbq_gather_offset_): Likewise.
(mve_vldrbq_gather_offset_z_): Likewise.
(mve_vldrbq_z_): Likewise.
(mve_vldrdq_gather_base_v2di): Likewise.
(mve_vldrdq_gather_base_wb_v2di_insn): Likewise.
(mve_vldrdq_gather_base_wb_z_v2di_insn): Likewise.
(mve_vldrdq_gather_base_z_v2di): Likewise.
(mve_vldrdq_gather_offset_v2di): Likewise.
(mve_vldrdq_gather_offset_z_v2di): Likewise.
(mve_vldrdq_gather_shifted_offset_v2di): Likewise.
(mve_vldrdq_gather_shifted_offset_z_v2di): Likewise.
(mve_vldrhq_): Likewise.
(mve_vldrhq_fv8hf): Likewise.
(mve_vldrhq_gather_offset_): Likewi

[PATCH v6 4/5] doloop: Add support for predicated vectorized loops

2024-02-27 Thread Andre Vieira

This patch adds support in the target agnostic doloop pass for the detection of
predicated vectorized hardware loops.  Arm is currently the only target that
will make use of this feature.

gcc/ChangeLog:

* df-core.cc (df_bb_regno_only_def_find): New helper function.
* df.h (df_bb_regno_only_def_find): Declare new function.
* loop-doloop.cc (doloop_condition_get): Add support for detecting
predicated vectorized hardware loops.
(doloop_modify): Add support for GTU condition checks.
(doloop_optimize): Update costing computation to support alterations to
desc->niter_expr by the backend.

Co-authored-by: Stam Markianos-Wright 
---
 gcc/df-core.cc |  15 +
 gcc/df.h   |   1 +
 gcc/loop-doloop.cc | 164 +++--
 3 files changed, 113 insertions(+), 67 deletions(-)

diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index f0eb4c93957..b0e8a88d433 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+return temp;
+  else
+return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 84e5aa8b524..c4e690b40cf 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 529e810e530..8953e1de960 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
  forms:
 
  1)  (parallel [(set (pc) (if_then_else (condition)
-	  			(label_ref (label))
-(pc)))
-	 (set (reg) (plus (reg) (const_int -1)))
-	 (additional clobbers and uses)])
+	(label_ref (label))
+	(pc)))
+		 (set (reg) (plus (reg) (const_int -1)))
+		 (additional clobbers and uses)])
 
  The branch must be the first entry of the parallel (also required
  by jump.cc), and the second entry of the parallel must be a set of
@@ -96,19 +96,33 @@ doloop_condition_get (rtx_insn *doloop_pat)
  the loop counter in an if_then_else too.
 
  2)  (set (reg) (plus (reg) (const_int -1))
- (set (pc) (if_then_else (reg != 0)
-	 (label_ref (label))
-			 (pc))).  
+	 (set (pc) (if_then_else (reg != 0)
+ (label_ref (label))
+ (pc))).
 
- Some targets (ARM) do the comparison before the branch, as in the
+ 3) Some targets (Arm) do the comparison before the branch, as in the
  following form:
 
- 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-   (set (reg) (plus (reg) (const_int -1)))])
-(set (pc) (if_then_else (cc == NE)
-(label_ref (label))
-(pc))) */
-
+ (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
+		(set (reg) (plus (reg) (const_int -1)))])
+ (set (pc) (if_then_else (cc == NE)
+			 (label_ref (label))
+			 (pc)))
+
+  4) This form supports a construct that is used to represent a vectorized
+  do loop with predication, however we do not need to care about the
+  details of the predication here.
+  Arm uses this construct to support MVE tail predication.
+
+  (parallel
+   [(set (pc)
+	 (if_then_else (gtu (plus (reg) (const_int -n))
+(const_int n-1))
+			   (label_ref)
+			   (pc)))
+	(set (reg) (plus (reg) (const_int -n)))
+	(additional clobbers and uses)])
+ */
   pattern = PATTERN (doloop_pat);
 
   if (GET_CODE (pattern) != PARALLEL)
@@ -173,15 +187,17 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
 return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
  On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
 inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
-  || XEXP (inc_src, 0) != reg
-  || XEXP (inc_src, 1) != constm1_rtx)
+  || !rtx_equal_p (XEXP (inc_src, 0), reg)
+  || !CONST_IN

[PATCH v6 5/5] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-02-27 Thread Andre Vieira

This patch adds support for MVE Tail-Predicated Low Overhead Loops by using the
doloop funcitonality added to support predicated vectorized hardware loops.

gcc/ChangeLog:

* config/arm/arm-protos.h (arm_target_bb_ok_for_lob): Change
declaration to pass basic_block.
(arm_attempt_dlstp_transform): New declaration.
* config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): Define targethook.
(TARGET_PREDICT_DOLOOP_P): Likewise.
(arm_target_bb_ok_for_lob): Adapt condition.
(arm_mve_get_vctp_lanes): New function.
(arm_dl_usage_type): New internal enum.
(arm_get_required_vpr_reg): New function.
(arm_get_required_vpr_reg_param): New function.
(arm_get_required_vpr_reg_ret_val): New function.
(arm_mve_get_loop_vctp): New function.
(arm_mve_insn_predicated_by): New function.
(arm_mve_across_lane_insn_p): New function.
(arm_mve_load_store_insn_p): New function.
(arm_mve_impl_pred_on_outputs_p): New function.
(arm_mve_impl_pred_on_inputs_p): New function.
(arm_last_vect_def_insn): New function.
(arm_mve_impl_predicated_p): New function.
(arm_mve_check_reg_origin_is_num_elems): New function.
(arm_mve_dlstp_check_inc_counter): New function.
(arm_mve_dlstp_check_dec_counter): New function.
(arm_mve_loop_valid_for_dlstp): New function.
(arm_predict_doloop_p): New function.
(arm_loop_unroll_adjust): New function.
(arm_emit_mve_unpredicated_insn_to_seq): New function.
(arm_attempt_dlstp_transform): New function.
* config/arm/arm.opt (mdlstp): New option.
* config/arm/iteratords.md (dlstp_elemsize, letp_num_lanes,
letp_num_lanes_neg, letp_num_lanes_minus_1): New attributes.
(DLSTP, LETP): New iterators.
(predicated_doloop_end_internal): New pattern.
(dlstp_insn): New pattern.
* config/arm/thumb2.md (doloop_end): Adapt to support tail-predicated
loops.
(doloop_begin): Likewise.
* config/arm/types.md (mve_misc): New mve type to represent
predicated_loop_end insn sequences.
* config/arm/unspecs.md:
(DLSTP8, DLSTP16, DLSTP32, DSLTP64,
LETP8, LETP16, LETP32, LETP64): New unspecs for DLSTP and LETP.

gcc/testsuite/ChangeLog:

* gcc.target/arm/lob.h: Add new helpers.
* gcc.target/arm/lob1.c: Use new helpers.
* gcc.target/arm/lob6.c: Likewise.
* gcc.target/arm/dlstp-compile-asm-1.c: New test.
* gcc.target/arm/dlstp-compile-asm-2.c: New test.
* gcc.target/arm/dlstp-compile-asm-3.c: New test.
* gcc.target/arm/dlstp-int8x16.c: New test.
* gcc.target/arm/dlstp-int8x16-run.c: New test.
* gcc.target/arm/dlstp-int16x8.c: New test.
* gcc.target/arm/dlstp-int16x8-run.c: New test.
* gcc.target/arm/dlstp-int32x4.c: New test.
* gcc.target/arm/dlstp-int32x4-run.c: New test.
* gcc.target/arm/dlstp-int64x2.c: New test.
* gcc.target/arm/dlstp-int64x2-run.c: New test.
* gcc.target/arm/dlstp-invalid-asm.c: New test.

Co-authored-by: Stam Markianos-Wright 
---
 gcc/config/arm/arm-protos.h   |4 +-
 gcc/config/arm/arm.cc | 1249 -
 gcc/config/arm/arm.opt|3 +
 gcc/config/arm/iterators.md   |   15 +
 gcc/config/arm/mve.md |   50 +
 gcc/config/arm/thumb2.md  |  138 +-
 gcc/config/arm/types.md   |6 +-
 gcc/config/arm/unspecs.md |   14 +-
 gcc/testsuite/gcc.target/arm/lob.h|  128 +-
 gcc/testsuite/gcc.target/arm/lob1.c   |   23 +-
 gcc/testsuite/gcc.target/arm/lob6.c   |8 +-
 .../gcc.target/arm/mve/dlstp-compile-asm-1.c  |  146 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-2.c  |  749 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-3.c  |   46 +
 .../gcc.target/arm/mve/dlstp-int16x8-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int16x8.c|   31 +
 .../gcc.target/arm/mve/dlstp-int32x4-run.c|   45 +
 .../gcc.target/arm/mve/dlstp-int32x4.c|   31 +
 .../gcc.target/arm/mve/dlstp-int64x2-run.c|   48 +
 .../gcc.target/arm/mve/dlstp-int64x2.c|   28 +
 .../gcc.target/arm/mve/dlstp-int8x16-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int8x16.c|   32 +
 .../gcc.target/arm/mve/dlstp-invalid-asm.c|  521 +++
 23 files changed, 3321 insertions(+), 82 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-1.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-3.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
 create mode 100644 gcc/testsuite/gcc.target

[PATCH v3 0/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-01-19 Thread Andre Vieira
Hi,

Reworked the patches according to Kyrill's comments, made some other
non-functional changes and rebased.

Reposting as v3 so patchworks picks them up and runs the necessary testing.

Andre Vieira (2):
arm: Add define_attr to to create a mapping between MVE predicated and
  unpredicated insns
arm: Add support for MVE Tail-Predicated Low Overhead Loops

-- 
2.17.1


[PATCH v3 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-01-19 Thread Andre Vieira

Respin after comments from Kyrill and rebase. I also removed an if-then-else
construct in arm_mve_check_reg_origin_is_num_elems similar to the other 
functions
Kyrill pointed out.

After an earlier comment from Richard Sandiford I also added comments to the
two tail predication patterns added to explain the need for the unspecs.
diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 2cd560c9925..76c6ee95c16 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index e5a944486d7..75432f3f73a 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -668,6 +668,12 @@ static const scoped_attribute_specs *const arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34483,19 +34489,1135 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
  having as single predecessor and successor the body of the loop
  itself.  Only simple loops with a single basic block as body are
  supported for 'low over head loop' making sure that LE target is
  above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-&& single_pred_p (bb)
-&& single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-&& contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx_insn *insn)
+{
+  rtx insn_set = single_set (insn);
+  if (insn_set
+  && GET_CODE (SET_SRC (insn_set)) == UNSPEC
+  && (XINT (SET_SRC (insn_set), 1) == VCTP
+	  || XINT (SET_SRC (insn_set), 1) == VCTP_M))
+{
+  machine_mode mode = GET_MODE (SET_SRC (insn_set));
+  return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	 ? GET_MODE_NUNITS (mode) : 0;
+}
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+ and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+ this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+{
+  requires_vpr = true;
+  if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+  else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+  /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+  for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operand_alternative *op_alt
+	  = &recog_op_alt[alt * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	 VPR_REG reg_class.  */
+	  if (alternative_class (op_alt, op) != VPR_REG)
+	requires_vpr = false;
+	}
+  /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+  if (requires_vpr)
+	return recog_data.operand[op];
+}
+  return NULL_RTX;
+}
+
+/* Wrapper function of arm_get_required_vpr_reg 

[PATCH v3 1/2] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns

2024-01-19 Thread Andre Vieira

Reposting for testing purposes, no changes from v2 (other than rebase).
diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index 2a2207c0ba1..449e6935b32 100644
--- a/gcc/config/arm/arm.h
+++ b/gcc/config/arm/arm.h
@@ -2375,6 +2375,21 @@ extern int making_const_table;
   else if (TARGET_THUMB1)\
 thumb1_final_prescan_insn (INSN)
 
+/* These defines are useful to refer to the value of the mve_unpredicated_insn
+   insn attribute.  Note that, because these use the get_attr_* function, these
+   will change recog_data if (INSN) isn't current_insn.  */
+#define MVE_VPT_PREDICABLE_INSN_P(INSN)	\
+  (recog_memoized (INSN) >= 0		\
+   && get_attr_mve_unpredicated_insn (INSN) != CODE_FOR_nothing)
+
+#define MVE_VPT_PREDICATED_INSN_P(INSN)	\
+  (MVE_VPT_PREDICABLE_INSN_P (INSN)	\
+   && recog_memoized (INSN) != get_attr_mve_unpredicated_insn (INSN))
+
+#define MVE_VPT_UNPREDICATED_INSN_P(INSN)\
+  (MVE_VPT_PREDICABLE_INSN_P (INSN)	\
+   && recog_memoized (INSN) == get_attr_mve_unpredicated_insn (INSN))
+
 #define ARM_SIGN_EXTEND(x)  ((HOST_WIDE_INT)			\
   (HOST_BITS_PER_WIDE_INT <= 32 ? (unsigned HOST_WIDE_INT) (x)	\
: unsigned HOST_WIDE_INT)(x)) & (unsigned HOST_WIDE_INT) 0x) |\
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 4a98f2d7b62..3f2863adf44 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -124,6 +124,12 @@ (define_attr "fpu" "none,vfp"
 ; and not all ARM insns do.
 (define_attr "predicated" "yes,no" (const_string "no"))
 
+; An attribute that encodes the CODE_FOR_ of the MVE VPT unpredicated
+; version of a VPT-predicated instruction.  For unpredicated instructions
+; that are predicable, encode the same pattern's CODE_FOR_ as a way to
+; encode that it is a predicable instruction.
+(define_attr "mve_unpredicated_insn" "" (symbol_ref "CODE_FOR_nothing"))
+
 ; LENGTH of an instruction (in bytes)
 (define_attr "length" ""
   (const_int 4))
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 547d87f3bc8..7600bf62531 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2311,6 +2311,7 @@ (define_int_attr simd32_op [(UNSPEC_QADD8 "qadd8") (UNSPEC_QSUB8 "qsub8")
 
 (define_int_attr mmla_sfx [(UNSPEC_MATMUL_S "s8") (UNSPEC_MATMUL_U "u8")
 			   (UNSPEC_MATMUL_US "s8")])
+
 ;;MVE int attribute.
 (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VREV16Q_U "u") (VMVNQ_N_S "s") (VMVNQ_N_U "u")
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 0fabbaa931b..8aa0bded7f0 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -17,7 +17,7 @@
 ;; along with GCC; see the file COPYING3.  If not see
 ;; .
 
-(define_insn "*mve_mov"
+(define_insn "mve_mov"
   [(set (match_operand:MVE_types 0 "nonimmediate_operand" "=w,w,r,w   , w,   r,Ux,w")
 	(match_operand:MVE_types 1 "general_operand"  " w,r,w,DnDm,UxUi,r,w, Ul"))]
   "TARGET_HAVE_MVE || TARGET_HAVE_MVE_FLOAT"
@@ -81,18 +81,27 @@ (define_insn "*mve_mov"
   return "";
 }
 }
-  [(set_attr "type" "mve_move,mve_move,mve_move,mve_move,mve_load,multiple,mve_store,mve_load")
+   [(set_attr_alternative "mve_unpredicated_insn" [(symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")])
+   (set_attr "type" "mve_move,mve_move,mve_move,mve_move,mve_load,multiple,mve_store,mve_load")
(set_attr "length" "4,8,8,4,4,8,4,8")
(set_attr "thumb2_pool_range" "*,*,*,*,1018,*,*,*")
(set_attr "neg_pool_range" "*,*,*,*,996,*,*,*")])
 
-(define_insn "*mve_vdup"
+(define_insn "mve_vdup"
   [(set (match_operand:MVE_vecs 0 "s_register_operand" "=w")
 	(vec_duplicate:MVE_vecs
 	  (match_operand: 1 "s_register_operand" "r")))]
   "TARGET_HAVE_MVE || TARGET_HAVE_MVE_FLOAT"
   "vdup.\t%q0, %1"
-  [(set_attr "length" "4")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_vdup"))
+  (set_attr "length" "4")
(set_attr "type" "mve_move")])
 
 ;;
@@ -145,7 +154,8 @@ (define_insn "@mve_q_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   ".f%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_f"))
+  (set_attr "type" "mve_move")
 ])
 
 ;;
@@ -159,7 +169,8 @@ (define_insn "@mve_q_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   ".%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_f"))
+  (set_attr "type" "mve_move")
 ])
 
 ;;
@@ -173,7 +184,8 @@ (define_insn "mve_vq_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   "v.f%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_vq_f"

[PATCH 0/2] aarch64, bitint: Add support for _BitInt for AArch64 Little Endian

2024-01-25 Thread Andre Vieira
Hi,

This patch series adds support for _BitInt for AArch64 when compiling for
Little Endian. The first patch in the series fixes an issue that arises with
support for AArch64, the second patch adds the backend support for it.

Andre Vieira (2):
bitint: Use TARGET_ARRAY_MODE for large bitints where target supports it
aarch64: Add support for _BitInt

Patch series boostrapped and regression tested on aarch64-unknown-linux-gnu and 
x86_64-pc-linux-gnu.

Ok for trunk?

-- 
2.17.1


[PATCH 1/2] bitint: Use TARGET_ARRAY_MODE for large bitints where target supports it

2024-01-25 Thread Andre Vieira

This patch ensures we use TARGET_ARRAY_MODE to determine the storage mode of
large bitints that are represented as arrays in memory.  This is required to
support such bitints for aarch64 and potential other targets with similar
bitint specifications.  Existing tests like gcc.dg/torture/bitint-25.c are
affected by this for aarch64 targets.

gcc/ChangeLog:
stor-layout.cc (layout_type): Use TARGET_ARRAY_MODE for large bitints
for targets that implement it.
---
 gcc/stor-layout.cc | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/gcc/stor-layout.cc b/gcc/stor-layout.cc
index 4cf249133e9..31da2c123ab 100644
--- a/gcc/stor-layout.cc
+++ b/gcc/stor-layout.cc
@@ -2427,8 +2427,16 @@ layout_type (tree type)
 	  }
 	else
 	  {
-	SET_TYPE_MODE (type, BLKmode);
 	cnt = CEIL (TYPE_PRECISION (type), GET_MODE_PRECISION (limb_mode));
+	machine_mode mode;
+	/* Some targets use TARGET_ARRAY_MODE to select the mode they use
+	   for arrays with a specific element mode and a specific element
+	   count and we should use this mode for large bitints that are
+	   stored as such arrays.  */
+	if (!targetm.array_mode (limb_mode, cnt).exists (&mode)
+		|| !targetm.array_mode_supported_p (limb_mode, cnt))
+	  mode = BLKmode;
+	SET_TYPE_MODE (type, mode);
 	gcc_assert (info.abi_limb_mode == info.limb_mode
 			|| !info.big_endian == !WORDS_BIG_ENDIAN);
 	  }


[PATCH 2/2] aarch64: Add support for _BitInt

2024-01-25 Thread Andre Vieira

This patch adds support for C23's _BitInt for the AArch64 port when compiling
for little endianness.  Big Endianness requires further target-agnostic
support and we therefor disable it for now.

gcc/ChangeLog:

* config/aarch64/aarch64.cc (TARGET_C_BITINT_TYPE_INFO): Declare MACRO.
(aarch64_bitint_type_info): New function.
(aarch64_return_in_memory_1): Return large _BitInt's in memory.
(aarch64_function_arg_alignment): Adapt to correctly return the ABI
mandated alignment of _BitInt(N) where N > 128 as the alignment of
TImode.
(aarch64_composite_type_p): Return true for _BitInt(N), where N > 128.

libgcc/ChangeLog:

* config/aarch64/t-softfp: Add fixtfbitint, floatbitinttf and
floatbitinthf to the softfp_extras variable to ensure the
runtime support is available for _BitInt.
---
 gcc/config/aarch64/aarch64.cc  | 44 +-
 libgcc/config/aarch64/t-softfp |  3 ++-
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index e6bd3fd0bb4..48bac51bc7c 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -6534,7 +6534,7 @@ aarch64_return_in_memory_1 (const_tree type)
   machine_mode ag_mode;
   int count;
 
-  if (!AGGREGATE_TYPE_P (type)
+  if (!(AGGREGATE_TYPE_P (type) || TREE_CODE (type) == BITINT_TYPE)
   && TREE_CODE (type) != COMPLEX_TYPE
   && TREE_CODE (type) != VECTOR_TYPE)
 /* Simple scalar types always returned in registers.  */
@@ -6618,6 +6618,10 @@ aarch64_function_arg_alignment (machine_mode mode, const_tree type,
 
   gcc_assert (TYPE_MODE (type) == mode);
 
+  if (TREE_CODE (type) == BITINT_TYPE
+  && int_size_in_bytes (type) > 16)
+return GET_MODE_ALIGNMENT (TImode);
+
   if (!AGGREGATE_TYPE_P (type))
 {
   /* The ABI alignment is the natural alignment of the type, without
@@ -21793,6 +21797,11 @@ aarch64_composite_type_p (const_tree type,
   if (type && (AGGREGATE_TYPE_P (type) || TREE_CODE (type) == COMPLEX_TYPE))
 return true;
 
+  if (type
+  && TREE_CODE (type) == BITINT_TYPE
+  && int_size_in_bytes (type) > 16)
+return true;
+
   if (mode == BLKmode
   || GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT
   || GET_MODE_CLASS (mode) == MODE_COMPLEX_INT)
@@ -28330,6 +28339,36 @@ aarch64_excess_precision (enum excess_precision_type type)
   return FLT_EVAL_METHOD_UNPREDICTABLE;
 }
 
+/* Implement TARGET_C_BITINT_TYPE_INFO.
+   Return true if _BitInt(N) is supported and fill its details into *INFO.  */
+bool
+aarch64_bitint_type_info (int n, struct bitint_info *info)
+{
+  if (TARGET_BIG_END)
+return false;
+
+  if (n <= 8)
+info->limb_mode = QImode;
+  else if (n <= 16)
+info->limb_mode = HImode;
+  else if (n <= 32)
+info->limb_mode = SImode;
+  else if (n <= 64)
+info->limb_mode = DImode;
+  else if (n <= 128)
+info->limb_mode = TImode;
+  else
+info->limb_mode = DImode;
+
+  if (n > 128)
+info->abi_limb_mode = TImode;
+  else
+info->abi_limb_mode = info->limb_mode;
+  info->big_endian = TARGET_BIG_END;
+  info->extended = false;
+  return true;
+}
+
 /* Implement TARGET_SCHED_CAN_SPECULATE_INSN.  Return true if INSN can be
scheduled for speculative execution.  Reject the long-running division
and square-root instructions.  */
@@ -30439,6 +30478,9 @@ aarch64_run_selftests (void)
 #undef TARGET_C_EXCESS_PRECISION
 #define TARGET_C_EXCESS_PRECISION aarch64_excess_precision
 
+#undef TARGET_C_BITINT_TYPE_INFO
+#define TARGET_C_BITINT_TYPE_INFO aarch64_bitint_type_info
+
 #undef  TARGET_EXPAND_BUILTIN
 #define TARGET_EXPAND_BUILTIN aarch64_expand_builtin
 
diff --git a/libgcc/config/aarch64/t-softfp b/libgcc/config/aarch64/t-softfp
index 2e32366f891..a335a34c243 100644
--- a/libgcc/config/aarch64/t-softfp
+++ b/libgcc/config/aarch64/t-softfp
@@ -4,7 +4,8 @@ softfp_extensions := sftf dftf hftf bfsf
 softfp_truncations := tfsf tfdf tfhf tfbf dfbf sfbf hfbf
 softfp_exclude_libgcc2 := n
 softfp_extras += fixhfti fixunshfti floattihf floatuntihf \
-		 floatdibf floatundibf floattibf floatuntibf
+		 floatdibf floatundibf floattibf floatuntibf \
+		 fixtfbitint floatbitinttf floatbitinthf
 
 TARGET_LIBGCC2_CFLAGS += -Wno-missing-prototypes
 


[PATCH 0/3] vect, aarch64: Add SVE support for simdclones

2024-01-30 Thread Andre Vieira
Hi,

This patch series is a set of patches that I have sent up for review before and 
it enables initial support SVE simd clones with some caveats.
Caveat 1: we do not support SVE simd clones with function bodies.
To enable support for this we need to change the way we 'simdify' a function 
body. For each argument that maps to a vector an array is created with 
'simdlen'. This however does not work for VLA simdlen.  We will need to come up 
with a way to support this such that the generated code is performant, there's 
little reason to 'simdify' a function by generating really slow code. I have 
some ideas on how we might be able to do this, though I'm not convinced it's 
even worth trying, but I think that's a bigger discussion.  For now I've 
disabled generating SVE simdclones for functions with function bodies.  This 
still fits our libmvec usecase as the simd clones are handwritten using 
intrinsics in glibc.

Caveat 2: we can not generate ncopy calls to a SVE simd clone call.
When I first sent the second patch of this series upstream Richi asked me to 
look at enabling being able to support calling ncopies of VLA simdlen simd 
clones, I have vectorizer code to do this, however I found that we didn't yet 
have enough backend support to be able to index VLA vectors to support this.  I 
think that's something that will need to wait until gcc 15, so for now I'd 
simply reject vectorization where that is required.

Caveat 3: we don't yet support SVE simdclones for VLS codegen.
We've disabled the use of SVE simdclones when the -msve-vector-bits option is 
used to request VLS codegen. We need this because the mangling is determined by 
the 'simdlen' of a simd clone which will not be VLA when -msve-vector-bits is 
passed. We would like to support using VLA simd clones when generating VLS, but 
for that to work right now we'd need to set the simdlen of the simd clone to 
the VLS value and that messes up the mangling.  In the future we will need to 
add a target hook to specify the mangling.

Given that the target agnostic changes are minimal, have been suggested before 
and have no impact on other targets, the target specific parts have been 
reviewed before, would this still be acceptable for Stage 4? I would really 
like to make use of the work that was done to support this and the SVE 
simdclones added to glibc.

Kind regards,
Andre

Andre Vieira (3):
vect: Pass stmt_vec_info to TARGET_SIMD_CLONE_USABLE
vect: disable multiple calls of poly simdclones
aarch64: Add SVE support for simd clones [PR 96342]

-- 
2.17.1


[PATCH 1/3] vect: Pass stmt_vec_info to TARGET_SIMD_CLONE_USABLE

2024-01-30 Thread Andre Vieira

This patch adds stmt_vec_info to TARGET_SIMD_CLONE_USABLE to make sure the
target can reject a simd_clone based on the vector mode it is using.
This is needed because for VLS SVE vectorization the vectorizer accepts
Advanced SIMD simd clones when vectorizing using SVE types because the simdlens
might match.  This will cause type errors later on.

Other targets do not currently need to use this argument.

gcc/ChangeLog:

* target.def (TARGET_SIMD_CLONE_USABLE): Add argument.
* tree-vect-stmts.cc (vectorizable_simd_clone_call): Pass stmt_info to
call TARGET_SIMD_CLONE_USABLE.
* config/aarch64/aarch64.cc (aarch64_simd_clone_usable): Add argument
and use it to reject the use of SVE simd clones with Advanced SIMD
modes.
* config/gcn/gcn.cc (gcn_simd_clone_usable): Add unused argument.
* config/i386/i386.cc (ix86_simd_clone_usable): Likewise.

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index a37d47b243e..31617510160 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -28694,13 +28694,16 @@ aarch64_simd_clone_adjust (struct cgraph_node *node)
 /* Implement TARGET_SIMD_CLONE_USABLE.  */
 
 static int
-aarch64_simd_clone_usable (struct cgraph_node *node)
+aarch64_simd_clone_usable (struct cgraph_node *node, stmt_vec_info stmt_vinfo)
 {
   switch (node->simdclone->vecsize_mangle)
 {
 case 'n':
   if (!TARGET_SIMD)
 	return -1;
+  if (STMT_VINFO_VECTYPE (stmt_vinfo)
+	  && aarch64_sve_mode_p (TYPE_MODE (STMT_VINFO_VECTYPE (stmt_vinfo
+	return -1;
   return 0;
 default:
   gcc_unreachable ();
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index e80de2ce056..c48b212d9e6 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -5658,7 +5658,8 @@ gcn_simd_clone_adjust (struct cgraph_node *ARG_UNUSED (node))
 /* Implement TARGET_SIMD_CLONE_USABLE.  */
 
 static int
-gcn_simd_clone_usable (struct cgraph_node *ARG_UNUSED (node))
+gcn_simd_clone_usable (struct cgraph_node *ARG_UNUSED (node),
+		   stmt_vec_info ARG_UNUSED (stmt_vinfo))
 {
   /* We don't need to do anything here because
  gcn_simd_clone_compute_vecsize_and_simdlen currently only returns one
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index b3e7c74846e..63e6b9d2643 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -25193,7 +25193,8 @@ ix86_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
slightly less desirable, etc.).  */
 
 static int
-ix86_simd_clone_usable (struct cgraph_node *node)
+ix86_simd_clone_usable (struct cgraph_node *node,
+			stmt_vec_info ARG_UNUSED (stmt_vinfo))
 {
   switch (node->simdclone->vecsize_mangle)
 {
diff --git a/gcc/target.def b/gcc/target.def
index fdad7bbc93e..4fade9c4eec 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1648,7 +1648,7 @@ DEFHOOK
 in vectorized loops in current function, or non-negative number if it is\n\
 usable.  In that case, the smaller the number is, the more desirable it is\n\
 to use it.",
-int, (struct cgraph_node *), NULL)
+int, (struct cgraph_node *, _stmt_vec_info *), NULL)
 
 HOOK_VECTOR_END (simd_clone)
 
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 1dbe1115da4..da02082c034 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -4074,7 +4074,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, stmt_vec_info stmt_info,
 	  this_badness += floor_log2 (num_calls) * 4096;
 	if (n->simdclone->inbranch)
 	  this_badness += 8192;
-	int target_badness = targetm.simd_clone.usable (n);
+	int target_badness = targetm.simd_clone.usable (n, stmt_info);
 	if (target_badness < 0)
 	  continue;
 	this_badness += target_badness * 512;


[PATCH 2/3] vect: disable multiple calls of poly simdclones

2024-01-30 Thread Andre Vieira

The current codegen code to support VF's that are multiples of a simdclone
simdlen rely on BIT_FIELD_REF to create multiple input vectors.  This does not
work for non-constant simdclones, so we should disable using such clones when
the VF is a multiple of the non-constant simdlen until we change the codegen to
support those.

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_simd_clone_call): Reject simdclones
with non-constant simdlen when VF is not exactly the same.

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index da02082c034..9bfb898683d 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -4068,7 +4068,10 @@ vectorizable_simd_clone_call (vec_info *vinfo, stmt_vec_info stmt_info,
 	if (!constant_multiple_p (vf * group_size, n->simdclone->simdlen,
   &num_calls)
 	|| (!n->simdclone->inbranch && (masked_call_offset > 0))
-	|| (nargs != simd_nargs))
+	|| (nargs != simd_nargs)
+	/* Currently we do not support multiple calls of non-constant
+	   simdlen as poly vectors can not be accessed by BIT_FIELD_REF.  */
+	|| (!n->simdclone->simdlen.is_constant () && num_calls != 1))
 	  continue;
 	if (num_calls != 1)
 	  this_badness += floor_log2 (num_calls) * 4096;


[PATCH 3/3] aarch64: Add SVE support for simd clones [PR 96342]

2024-01-30 Thread Andre Vieira

This patch finalizes adding support for the generation of SVE simd clones when
no simdlen is provided, following the ABI rules where the widest data type
determines the minimum amount of elements in a length agnostic vector.

gcc/ChangeLog:

* config/aarch64/aarch64-protos.h (add_sve_type_attribute): Declare.
* config/aarch64/aarch64-sve-builtins.cc (add_sve_type_attribute): Make
visibility global and support use for non_acle types.
* config/aarch64/aarch64.cc
(aarch64_simd_clone_compute_vecsize_and_simdlen): Create VLA simd clone
when no simdlen is provided, according to ABI rules.
(simd_clone_adjust_sve_vector_type): New helper function.
(aarch64_simd_clone_adjust): Add '+sve' attribute to SVE simd clones
and modify types to use SVE types.
* omp-simd-clone.cc (simd_clone_mangle): Print 'x' for VLA simdlen.
(simd_clone_adjust): Adapt safelen check to be compatible with VLA
simdlen.

gcc/testsuite/ChangeLog:

* c-c++-common/gomp/declare-variant-14.c: Make i?86 and x86_64 target
only test.
* gfortran.dg/gomp/declare-variant-14.f90: Likewise.
* gcc.target/aarch64/declare-simd-2.c: Add SVE clone scan.
* gcc.target/aarch64/vect-simd-clone-1.c: New test.

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index a0b142e0b94..207396de0ff 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1031,6 +1031,8 @@ namespace aarch64_sve {
 #ifdef GCC_TARGET_H
   bool verify_type_context (location_t, type_context_kind, const_tree, bool);
 #endif
+ void add_sve_type_attribute (tree, unsigned int, unsigned int,
+			  const char *, const char *);
 }
 
 extern void aarch64_split_combinev16qi (rtx operands[3]);
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc b/gcc/config/aarch64/aarch64-sve-builtins.cc
index 11f5c5c500c..747131e684e 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -953,14 +953,16 @@ static bool reported_missing_registers_p;
 /* Record that TYPE is an ABI-defined SVE type that contains NUM_ZR SVE vectors
and NUM_PR SVE predicates.  MANGLED_NAME, if nonnull, is the ABI-defined
mangling of the type.  ACLE_NAME is the  name of the type.  */
-static void
+void
 add_sve_type_attribute (tree type, unsigned int num_zr, unsigned int num_pr,
 			const char *mangled_name, const char *acle_name)
 {
   tree mangled_name_tree
 = (mangled_name ? get_identifier (mangled_name) : NULL_TREE);
+  tree acle_name_tree
+= (acle_name ? get_identifier (acle_name) : NULL_TREE);
 
-  tree value = tree_cons (NULL_TREE, get_identifier (acle_name), NULL_TREE);
+  tree value = tree_cons (NULL_TREE, acle_name_tree, NULL_TREE);
   value = tree_cons (NULL_TREE, mangled_name_tree, value);
   value = tree_cons (NULL_TREE, size_int (num_pr), value);
   value = tree_cons (NULL_TREE, size_int (num_zr), value);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 31617510160..cba8879ab33 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -28527,7 +28527,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
 	int num, bool explicit_p)
 {
   tree t, ret_type;
-  unsigned int nds_elt_bits;
+  unsigned int nds_elt_bits, wds_elt_bits;
   unsigned HOST_WIDE_INT const_simdlen;
 
   if (!TARGET_SIMD)
@@ -28572,10 +28572,14 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
   if (TREE_CODE (ret_type) != VOID_TYPE)
 {
   nds_elt_bits = lane_size (SIMD_CLONE_ARG_TYPE_VECTOR, ret_type);
+  wds_elt_bits = nds_elt_bits;
   vec_elts.safe_push (std::make_pair (ret_type, nds_elt_bits));
 }
   else
-nds_elt_bits = POINTER_SIZE;
+{
+  nds_elt_bits = POINTER_SIZE;
+  wds_elt_bits = 0;
+}
 
   int i;
   tree type_arg_types = TYPE_ARG_TYPES (TREE_TYPE (node->decl));
@@ -28583,44 +28587,72 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
   for (t = (decl_arg_p ? DECL_ARGUMENTS (node->decl) : type_arg_types), i = 0;
t && t != void_list_node; t = TREE_CHAIN (t), i++)
 {
-  tree arg_type = decl_arg_p ? TREE_TYPE (t) : TREE_VALUE (t);
+  tree type = decl_arg_p ? TREE_TYPE (t) : TREE_VALUE (t);
   if (clonei->args[i].arg_type != SIMD_CLONE_ARG_TYPE_UNIFORM
-	  && !supported_simd_type (arg_type))
+	  && !supported_simd_type (type))
 	{
 	  if (!explicit_p)
 	;
-	  else if (COMPLEX_FLOAT_TYPE_P (ret_type))
+	  else if (COMPLEX_FLOAT_TYPE_P (type))
 	warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
 			"GCC does not currently support argument type %qT "
-			"for simd", arg_type);
+			"for simd", type);
 	  else
 	warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
 			"unsupported argument type %qT for simd",
-			arg_type);
+			type);
 	  return 0;
 	}
-  unsigned 

[PATCH 0/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-12-18 Thread Andre Vieira
Resending series to make use of the Linaro pre-commit CI in patchworks.

Andre Vieira (2):
  arm: Add define_attr to to create a mapping between MVE predicated and
unpredicated insns
  arm: Add support for MVE Tail-Predicated Low Overhead Loops

-- 
2.17.1


[PATCH 1/2] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns

2023-12-18 Thread Andre Vieira

Re-sending Stam's first patch, same as:
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/635301.html

Hopefully patchworks can pick this up :)

diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index a9c2752c0ea..0b0e8620717 100644
--- a/gcc/config/arm/arm.h
+++ b/gcc/config/arm/arm.h
@@ -2375,6 +2375,21 @@ extern int making_const_table;
   else if (TARGET_THUMB1)\
 thumb1_final_prescan_insn (INSN)
 
+/* These defines are useful to refer to the value of the mve_unpredicated_insn
+   insn attribute.  Note that, because these use the get_attr_* function, these
+   will change recog_data if (INSN) isn't current_insn.  */
+#define MVE_VPT_PREDICABLE_INSN_P(INSN)	\
+  (recog_memoized (INSN) >= 0		\
+  && get_attr_mve_unpredicated_insn (INSN) != 0)			\
+
+#define MVE_VPT_PREDICATED_INSN_P(INSN)	\
+  (MVE_VPT_PREDICABLE_INSN_P (INSN)	\
+   && recog_memoized (INSN) != get_attr_mve_unpredicated_insn (INSN))	\
+
+#define MVE_VPT_UNPREDICATED_INSN_P(INSN)\
+  (MVE_VPT_PREDICABLE_INSN_P (INSN)	\
+   && recog_memoized (INSN) == get_attr_mve_unpredicated_insn (INSN))	\
+
 #define ARM_SIGN_EXTEND(x)  ((HOST_WIDE_INT)			\
   (HOST_BITS_PER_WIDE_INT <= 32 ? (unsigned HOST_WIDE_INT) (x)	\
: unsigned HOST_WIDE_INT)(x)) & (unsigned HOST_WIDE_INT) 0x) |\
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 07eaf06cdea..8efdebecc3c 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -124,6 +124,8 @@ (define_attr "fpu" "none,vfp"
 ; and not all ARM insns do.
 (define_attr "predicated" "yes,no" (const_string "no"))
 
+(define_attr "mve_unpredicated_insn" "" (const_int 0))
+
 ; LENGTH of an instruction (in bytes)
 (define_attr "length" ""
   (const_int 4))
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index a9803538101..5ea2d9e8668 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2305,6 +2305,7 @@ (define_int_attr simd32_op [(UNSPEC_QADD8 "qadd8") (UNSPEC_QSUB8 "qsub8")
 
 (define_int_attr mmla_sfx [(UNSPEC_MATMUL_S "s8") (UNSPEC_MATMUL_U "u8")
 			   (UNSPEC_MATMUL_US "s8")])
+
 ;;MVE int attribute.
 (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VREV16Q_U "u") (VMVNQ_N_S "s") (VMVNQ_N_U "u")
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index b0d3443da9c..62df022ef19 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -17,7 +17,7 @@
 ;; along with GCC; see the file COPYING3.  If not see
 ;; .
 
-(define_insn "*mve_mov"
+(define_insn "mve_mov"
   [(set (match_operand:MVE_types 0 "nonimmediate_operand" "=w,w,r,w   , w,   r,Ux,w")
 	(match_operand:MVE_types 1 "general_operand"  " w,r,w,DnDm,UxUi,r,w, Ul"))]
   "TARGET_HAVE_MVE || TARGET_HAVE_MVE_FLOAT"
@@ -81,18 +81,27 @@ (define_insn "*mve_mov"
   return "";
 }
 }
-  [(set_attr "type" "mve_move,mve_move,mve_move,mve_move,mve_load,multiple,mve_store,mve_load")
+   [(set_attr_alternative "mve_unpredicated_insn" [(symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")])
+   (set_attr "type" "mve_move,mve_move,mve_move,mve_move,mve_load,multiple,mve_store,mve_load")
(set_attr "length" "4,8,8,4,4,8,4,8")
(set_attr "thumb2_pool_range" "*,*,*,*,1018,*,*,*")
(set_attr "neg_pool_range" "*,*,*,*,996,*,*,*")])
 
-(define_insn "*mve_vdup"
+(define_insn "mve_vdup"
   [(set (match_operand:MVE_vecs 0 "s_register_operand" "=w")
 	(vec_duplicate:MVE_vecs
 	  (match_operand: 1 "s_register_operand" "r")))]
   "TARGET_HAVE_MVE || TARGET_HAVE_MVE_FLOAT"
   "vdup.\t%q0, %1"
-  [(set_attr "length" "4")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_vdup"))
+  (set_attr "length" "4")
(set_attr "type" "mve_move")])
 
 ;;
@@ -145,7 +154,8 @@ (define_insn "@mve_q_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   ".f%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_f"))
+  (set_attr "type" "mve_move")
 ])
 
 ;;
@@ -159,7 +169,8 @@ (define_insn "@mve_q_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   ".%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_f"))
+  (set_attr "type" "mve_move")
 ])
 
 ;;
@@ -173,7 +184,8 @@ (define_insn "mve_vq_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   "v.f%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_vq_f"))
+  (set_attr "type" "mve_move")
 ])
 
 ;;
@@ -187,7 +199,8 @@ (define_insn "@mve_q_n_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   ".%#\t%q0, %1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_u

[PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-12-18 Thread Andre Vieira

Reworked Stam's patch after comments in:
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640362.html

The original gcc ChangeLog remains unchanged, but I did split up some tests so
here is the testsuite ChangeLog.


gcc/testsuite/ChangeLog:

* gcc.target/arm/lob.h: Update framework.
* gcc.target/arm/lob1.c: Likewise.
* gcc.target/arm/lob6.c: Likewise.
* gcc.target/arm/mve/dlstp-compile-asm.c: New test.
* gcc.target/arm/mve/dlstp-int16x8.c: New test.
* gcc.target/arm/mve/dlstp-int16x8-run.c: New test.
* gcc.target/arm/mve/dlstp-int32x4.c: New test.
* gcc.target/arm/mve/dlstp-int32x4-run.c: New test.
* gcc.target/arm/mve/dlstp-int64x2.c: New test.
* gcc.target/arm/mve/dlstp-int64x2-run.c: New test.
* gcc.target/arm/mve/dlstp-int8x16.c: New test.
* gcc.target/arm/mve/dlstp-int8x16-run.c: New test.
* gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 2f5ca79ed8d..4f164c54740 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 0c0cb14a8a4..1ee72bcb7ec 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -668,6 +668,12 @@ static const scoped_attribute_specs *const arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34483,19 +34489,1147 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
  having as single predecessor and successor the body of the loop
  itself.  Only simple loops with a single basic block as body are
  supported for 'low over head loop' making sure that LE target is
  above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-&& single_pred_p (bb)
-&& single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-&& contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx_insn *insn)
+{
+  rtx insn_set = single_set (insn);
+  if (insn_set
+  && GET_CODE (SET_SRC (insn_set)) == UNSPEC
+  && (XINT (SET_SRC (insn_set), 1) == VCTP
+	  || XINT (SET_SRC (insn_set), 1) == VCTP_M))
+{
+  machine_mode mode = GET_MODE (SET_SRC (insn_set));
+  return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	 ? GET_MODE_NUNITS (mode) : 0;
+}
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+ and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+ this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+{
+  requires_vpr = true;
+  if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+  else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+  /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+  for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operan

[PATCH v2 0/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-01-05 Thread Andre Vieira
Hi,

Resending series version 2 addression comments on first version, also moved
parts of the first patch to the second so it can be built without the second
patch.

Andre Vieira (2):
  arm: Add define_attr to to create a mapping between MVE predicated and
unpredicated insns
  arm: Add support for MVE Tail-Predicated Low Overhead Loops


-- 
2.17.1


[PATCH v2 1/2] arm: Add define_attr to to create a mapping between MVE predicated and unpredicated insns

2024-01-05 Thread Andre Vieira
Respin of first version to address comments and make it buildable on its own.

diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index a9c2752c0ea..f0b01b7461f 100644
--- a/gcc/config/arm/arm.h
+++ b/gcc/config/arm/arm.h
@@ -2375,6 +2375,21 @@ extern int making_const_table;
   else if (TARGET_THUMB1)\
 thumb1_final_prescan_insn (INSN)
 
+/* These defines are useful to refer to the value of the mve_unpredicated_insn
+   insn attribute.  Note that, because these use the get_attr_* function, these
+   will change recog_data if (INSN) isn't current_insn.  */
+#define MVE_VPT_PREDICABLE_INSN_P(INSN)	\
+  (recog_memoized (INSN) >= 0		\
+   && get_attr_mve_unpredicated_insn (INSN) != CODE_FOR_nothing)
+
+#define MVE_VPT_PREDICATED_INSN_P(INSN)	\
+  (MVE_VPT_PREDICABLE_INSN_P (INSN)	\
+   && recog_memoized (INSN) != get_attr_mve_unpredicated_insn (INSN))
+
+#define MVE_VPT_UNPREDICATED_INSN_P(INSN)\
+  (MVE_VPT_PREDICABLE_INSN_P (INSN)	\
+   && recog_memoized (INSN) == get_attr_mve_unpredicated_insn (INSN))
+
 #define ARM_SIGN_EXTEND(x)  ((HOST_WIDE_INT)			\
   (HOST_BITS_PER_WIDE_INT <= 32 ? (unsigned HOST_WIDE_INT) (x)	\
: unsigned HOST_WIDE_INT)(x)) & (unsigned HOST_WIDE_INT) 0x) |\
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 07eaf06cdea..296212be33f 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -124,6 +124,12 @@ (define_attr "fpu" "none,vfp"
 ; and not all ARM insns do.
 (define_attr "predicated" "yes,no" (const_string "no"))
 
+; An attribute that encodes the CODE_FOR_ of the MVE VPT unpredicated
+; version of a VPT-predicated instruction.  For unpredicated instructions
+; that are predicable, encode the same pattern's CODE_FOR_ as a way to
+; encode that it is a predicable instruction.
+(define_attr "mve_unpredicated_insn" "" (symbol_ref "CODE_FOR_nothing"))
+
 ; LENGTH of an instruction (in bytes)
 (define_attr "length" ""
   (const_int 4))
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index a9803538101..5ea2d9e8668 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -2305,6 +2305,7 @@ (define_int_attr simd32_op [(UNSPEC_QADD8 "qadd8") (UNSPEC_QSUB8 "qsub8")
 
 (define_int_attr mmla_sfx [(UNSPEC_MATMUL_S "s8") (UNSPEC_MATMUL_U "u8")
 			   (UNSPEC_MATMUL_US "s8")])
+
 ;;MVE int attribute.
 (define_int_attr supf [(VCVTQ_TO_F_S "s") (VCVTQ_TO_F_U "u") (VREV16Q_S "s")
 		   (VREV16Q_U "u") (VMVNQ_N_S "s") (VMVNQ_N_U "u")
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index b0d3443da9c..b1862d7977e 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -17,7 +17,7 @@
 ;; along with GCC; see the file COPYING3.  If not see
 ;; .
 
-(define_insn "*mve_mov"
+(define_insn "mve_mov"
   [(set (match_operand:MVE_types 0 "nonimmediate_operand" "=w,w,r,w   , w,   r,Ux,w")
 	(match_operand:MVE_types 1 "general_operand"  " w,r,w,DnDm,UxUi,r,w, Ul"))]
   "TARGET_HAVE_MVE || TARGET_HAVE_MVE_FLOAT"
@@ -81,18 +81,27 @@ (define_insn "*mve_mov"
   return "";
 }
 }
-  [(set_attr "type" "mve_move,mve_move,mve_move,mve_move,mve_load,multiple,mve_store,mve_load")
+   [(set_attr_alternative "mve_unpredicated_insn" [(symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")
+		   (symbol_ref "CODE_FOR_mve_mov")
+		   (symbol_ref "CODE_FOR_nothing")])
+   (set_attr "type" "mve_move,mve_move,mve_move,mve_move,mve_load,multiple,mve_store,mve_load")
(set_attr "length" "4,8,8,4,4,8,4,8")
(set_attr "thumb2_pool_range" "*,*,*,*,1018,*,*,*")
(set_attr "neg_pool_range" "*,*,*,*,996,*,*,*")])
 
-(define_insn "*mve_vdup"
+(define_insn "mve_vdup"
   [(set (match_operand:MVE_vecs 0 "s_register_operand" "=w")
 	(vec_duplicate:MVE_vecs
 	  (match_operand: 1 "s_register_operand" "r")))]
   "TARGET_HAVE_MVE || TARGET_HAVE_MVE_FLOAT"
   "vdup.\t%q0, %1"
-  [(set_attr "length" "4")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_vdup"))
+  (set_attr "length" "4")
(set_attr "type" "mve_move")])
 
 ;;
@@ -145,7 +154,8 @@ (define_insn "@mve_q_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   ".f%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_f"))
+  (set_attr "type" "mve_move")
 ])
 
 ;;
@@ -159,7 +169,8 @@ (define_insn "@mve_q_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   ".%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve_q_f"))
+  (set_attr "type" "mve_move")
 ])
 
 ;;
@@ -173,7 +184,8 @@ (define_insn "mve_vq_f"
   ]
   "TARGET_HAVE_MVE && TARGET_HAVE_MVE_FLOAT"
   "v.f%#\t%q0, %q1"
-  [(set_attr "type" "mve_move")
+ [(set (attr "mve_unpredicated_insn") (symbol_ref "CODE_FOR_mve

[PATCH v2 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-01-05 Thread Andre Vieira
Respin after comments on first version.
diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 2f5ca79ed8d..4f164c54740 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -65,8 +65,8 @@ extern void arm_emit_speculation_barrier_function (void);
 extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *);
 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
-extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern bool arm_target_bb_ok_for_lob (basic_block);
+extern rtx arm_attempt_dlstp_transform (rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 0c0cb14a8a4..1ee72bcb7ec 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -668,6 +668,12 @@ static const scoped_attribute_specs *const arm_attribute_table[] =
 #undef TARGET_HAVE_CONDITIONAL_EXECUTION
 #define TARGET_HAVE_CONDITIONAL_EXECUTION arm_have_conditional_execution
 
+#undef TARGET_LOOP_UNROLL_ADJUST
+#define TARGET_LOOP_UNROLL_ADJUST arm_loop_unroll_adjust
+
+#undef TARGET_PREDICT_DOLOOP_P
+#define TARGET_PREDICT_DOLOOP_P arm_predict_doloop_p
+
 #undef TARGET_LEGITIMATE_CONSTANT_P
 #define TARGET_LEGITIMATE_CONSTANT_P arm_legitimate_constant_p
 
@@ -34483,19 +34489,1147 @@ arm_invalid_within_doloop (const rtx_insn *insn)
 }
 
 bool
-arm_target_insn_ok_for_lob (rtx insn)
+arm_target_bb_ok_for_lob (basic_block bb)
 {
-  basic_block bb = BLOCK_FOR_INSN (insn);
   /* Make sure the basic block of the target insn is a simple latch
  having as single predecessor and successor the body of the loop
  itself.  Only simple loops with a single basic block as body are
  supported for 'low over head loop' making sure that LE target is
  above LE itself in the generated code.  */
-
   return single_succ_p (bb)
-&& single_pred_p (bb)
-&& single_succ_edge (bb)->dest == single_pred_edge (bb)->src
-&& contains_no_active_insn_p (bb);
+	 && single_pred_p (bb)
+	 && single_succ_edge (bb)->dest == single_pred_edge (bb)->src;
+}
+
+/* Utility fuction: Given a VCTP or a VCTP_M insn, return the number of MVE
+   lanes based on the machine mode being used.  */
+
+static int
+arm_mve_get_vctp_lanes (rtx_insn *insn)
+{
+  rtx insn_set = single_set (insn);
+  if (insn_set
+  && GET_CODE (SET_SRC (insn_set)) == UNSPEC
+  && (XINT (SET_SRC (insn_set), 1) == VCTP
+	  || XINT (SET_SRC (insn_set), 1) == VCTP_M))
+{
+  machine_mode mode = GET_MODE (SET_SRC (insn_set));
+  return (VECTOR_MODE_P (mode) && VALID_MVE_PRED_MODE (mode))
+	 ? GET_MODE_NUNITS (mode) : 0;
+}
+  return 0;
+}
+
+/* Check if INSN requires the use of the VPR reg, if it does, return the
+   sub-rtx of the VPR reg.  The TYPE argument controls whether
+   this function should:
+   * For TYPE == 0, check all operands, including the OUT operands,
+ and return the first occurrence of the VPR reg.
+   * For TYPE == 1, only check the input operands.
+   * For TYPE == 2, only check the output operands.
+   (INOUT operands are considered both as input and output operands)
+*/
+static rtx
+arm_get_required_vpr_reg (rtx_insn *insn, unsigned int type = 0)
+{
+  gcc_assert (type < 3);
+  if (!NONJUMP_INSN_P (insn))
+return NULL_RTX;
+
+  bool requires_vpr;
+  extract_constrain_insn (insn);
+  int n_operands = recog_data.n_operands;
+  if (recog_data.n_alternatives == 0)
+return NULL_RTX;
+
+  /* Fill in recog_op_alt with information about the constraints of
+ this insn.  */
+  preprocess_constraints (insn);
+
+  for (int op = 0; op < n_operands; op++)
+{
+  requires_vpr = true;
+  if (type == 1 && recog_data.operand_type[op] == OP_OUT)
+	continue;
+  else if (type == 2 && recog_data.operand_type[op] == OP_IN)
+	continue;
+
+  /* Iterate through alternatives of operand "op" in recog_op_alt and
+	 identify if the operand is required to be the VPR.  */
+  for (int alt = 0; alt < recog_data.n_alternatives; alt++)
+	{
+	  const operand_alternative *op_alt
+	  = &recog_op_alt[alt * n_operands];
+	  /* Fetch the reg_class for each entry and check it against the
+	 VPR_REG reg_class.  */
+	  if (alternative_class (op_alt, op) != VPR_REG)
+	requires_vpr = false;
+	}
+  /* If all alternatives of the insn require the VPR reg for this operand,
+	 it means that either this is VPR-generating instruction, like a vctp,
+	 vcmp, etc., or it is a VPT-predicated insruction.  Return the subrtx
+	 of the VPR reg operand.  */
+  if (requires_vpr)
+	return recog_data.operand[op];
+}
+  return NULL_RTX;
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 1, so return
+   something only if the VPR reg is an input operand to the insn.  */
+
+static rtx
+arm_get_required_vpr_reg_param (rtx_insn *insn)
+{
+  return arm_get_required_vpr_reg (insn, 1);
+}
+
+/* Wrapper function of arm_get_required_vpr_reg with TYPE == 2, so return
+  

[PATCH 0/2] arm, doloop: Add support for MVE Tail-Predicated Low Overhead Loops

2024-05-23 Thread Andre Vieira

Hi,

  We held these two patches back in stage 4 because they touched 
target-agnostic code, though I am quite confident they will not affect other 
targets. Given stage one has reopened, I am reposting them, I rebased them but 
they seem to apply cleanly on trunk.

  OK for trunk?

Andre Vieira (2):
  doloop: Add support for predicated vectorized loops
  arm: Add support for MVE Tail-Predicated Low Overhead Loops

 gcc/config/arm/arm-protos.h   |4 +-
 gcc/config/arm/arm.cc | 1249 -
 gcc/config/arm/arm.opt|3 +
 gcc/config/arm/iterators.md   |   15 +
 gcc/config/arm/mve.md |   50 +
 gcc/config/arm/thumb2.md  |  138 +-
 gcc/config/arm/types.md   |6 +-
 gcc/config/arm/unspecs.md |   14 +-
 gcc/df-core.cc|   15 +
 gcc/df.h  |1 +
 gcc/loop-doloop.cc|  164 ++-
 gcc/testsuite/gcc.target/arm/lob.h|  128 +-
 gcc/testsuite/gcc.target/arm/lob1.c   |   23 +-
 gcc/testsuite/gcc.target/arm/lob6.c   |8 +-
 .../gcc.target/arm/mve/dlstp-compile-asm-1.c  |  146 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-2.c  |  749 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-3.c  |   46 +
 .../gcc.target/arm/mve/dlstp-int16x8-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int16x8.c|   31 +
 .../gcc.target/arm/mve/dlstp-int32x4-run.c|   45 +
 .../gcc.target/arm/mve/dlstp-int32x4.c|   31 +
 .../gcc.target/arm/mve/dlstp-int64x2-run.c|   48 +
 .../gcc.target/arm/mve/dlstp-int64x2.c|   28 +
 .../gcc.target/arm/mve/dlstp-int8x16-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int8x16.c|   32 +
 .../gcc.target/arm/mve/dlstp-invalid-asm.c|  521 +++
 26 files changed, 3434 insertions(+), 149 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-1.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-3.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c

-- 
2.17.1


[PATCH 1/2] doloop: Add support for predicated vectorized loops

2024-05-23 Thread Andre Vieira

This patch adds support in the target agnostic doloop pass for the detection of
predicated vectorized hardware loops.  Arm is currently the only target that
will make use of this feature.

gcc/ChangeLog:

* df-core.cc (df_bb_regno_only_def_find): New helper function.
* df.h (df_bb_regno_only_def_find): Declare new function.
* loop-doloop.cc (doloop_condition_get): Add support for detecting
predicated vectorized hardware loops.
(doloop_modify): Add support for GTU condition checks.
(doloop_optimize): Update costing computation to support alterations to
desc->niter_expr by the backend.

Co-authored-by: Stam Markianos-Wright 
---
 gcc/df-core.cc |  15 +
 gcc/df.h   |   1 +
 gcc/loop-doloop.cc | 164 +++--
 3 files changed, 113 insertions(+), 67 deletions(-)

diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index f0eb4c93957..b0e8a88d433 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+return temp;
+  else
+return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 84e5aa8b524..c4e690b40cf 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 529e810e530..8953e1de960 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
  forms:
 
  1)  (parallel [(set (pc) (if_then_else (condition)
-	  			(label_ref (label))
-(pc)))
-	 (set (reg) (plus (reg) (const_int -1)))
-	 (additional clobbers and uses)])
+	(label_ref (label))
+	(pc)))
+		 (set (reg) (plus (reg) (const_int -1)))
+		 (additional clobbers and uses)])
 
  The branch must be the first entry of the parallel (also required
  by jump.cc), and the second entry of the parallel must be a set of
@@ -96,19 +96,33 @@ doloop_condition_get (rtx_insn *doloop_pat)
  the loop counter in an if_then_else too.
 
  2)  (set (reg) (plus (reg) (const_int -1))
- (set (pc) (if_then_else (reg != 0)
-	 (label_ref (label))
-			 (pc))).  
+	 (set (pc) (if_then_else (reg != 0)
+ (label_ref (label))
+ (pc))).
 
- Some targets (ARM) do the comparison before the branch, as in the
+ 3) Some targets (Arm) do the comparison before the branch, as in the
  following form:
 
- 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-   (set (reg) (plus (reg) (const_int -1)))])
-(set (pc) (if_then_else (cc == NE)
-(label_ref (label))
-(pc))) */
-
+ (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
+		(set (reg) (plus (reg) (const_int -1)))])
+ (set (pc) (if_then_else (cc == NE)
+			 (label_ref (label))
+			 (pc)))
+
+  4) This form supports a construct that is used to represent a vectorized
+  do loop with predication, however we do not need to care about the
+  details of the predication here.
+  Arm uses this construct to support MVE tail predication.
+
+  (parallel
+   [(set (pc)
+	 (if_then_else (gtu (plus (reg) (const_int -n))
+(const_int n-1))
+			   (label_ref)
+			   (pc)))
+	(set (reg) (plus (reg) (const_int -n)))
+	(additional clobbers and uses)])
+ */
   pattern = PATTERN (doloop_pat);
 
   if (GET_CODE (pattern) != PARALLEL)
@@ -173,15 +187,17 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
 return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
  On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
 inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
-  || XEXP (inc_src, 0) != reg
-  || XEXP (inc_src, 1) != constm1_rtx)
+  || !rtx_equal_p (XEXP (inc_src, 0), reg)
+  || !CONST_IN

[PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-05-23 Thread Andre Vieira

This patch adds support for MVE Tail-Predicated Low Overhead Loops by using the
doloop funcitonality added to support predicated vectorized hardware loops.

gcc/ChangeLog:

* config/arm/arm-protos.h (arm_target_bb_ok_for_lob): Change
declaration to pass basic_block.
(arm_attempt_dlstp_transform): New declaration.
* config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): Define targethook.
(TARGET_PREDICT_DOLOOP_P): Likewise.
(arm_target_bb_ok_for_lob): Adapt condition.
(arm_mve_get_vctp_lanes): New function.
(arm_dl_usage_type): New internal enum.
(arm_get_required_vpr_reg): New function.
(arm_get_required_vpr_reg_param): New function.
(arm_get_required_vpr_reg_ret_val): New function.
(arm_mve_get_loop_vctp): New function.
(arm_mve_insn_predicated_by): New function.
(arm_mve_across_lane_insn_p): New function.
(arm_mve_load_store_insn_p): New function.
(arm_mve_impl_pred_on_outputs_p): New function.
(arm_mve_impl_pred_on_inputs_p): New function.
(arm_last_vect_def_insn): New function.
(arm_mve_impl_predicated_p): New function.
(arm_mve_check_reg_origin_is_num_elems): New function.
(arm_mve_dlstp_check_inc_counter): New function.
(arm_mve_dlstp_check_dec_counter): New function.
(arm_mve_loop_valid_for_dlstp): New function.
(arm_predict_doloop_p): New function.
(arm_loop_unroll_adjust): New function.
(arm_emit_mve_unpredicated_insn_to_seq): New function.
(arm_attempt_dlstp_transform): New function.
* config/arm/arm.opt (mdlstp): New option.
* config/arm/iteratords.md (dlstp_elemsize, letp_num_lanes,
letp_num_lanes_neg, letp_num_lanes_minus_1): New attributes.
(DLSTP, LETP): New iterators.
(predicated_doloop_end_internal): New pattern.
(dlstp_insn): New pattern.
* config/arm/thumb2.md (doloop_end): Adapt to support tail-predicated
loops.
(doloop_begin): Likewise.
* config/arm/types.md (mve_misc): New mve type to represent
predicated_loop_end insn sequences.
* config/arm/unspecs.md:
(DLSTP8, DLSTP16, DLSTP32, DSLTP64,
LETP8, LETP16, LETP32, LETP64): New unspecs for DLSTP and LETP.

gcc/testsuite/ChangeLog:

* gcc.target/arm/lob.h: Add new helpers.
* gcc.target/arm/lob1.c: Use new helpers.
* gcc.target/arm/lob6.c: Likewise.
* gcc.target/arm/dlstp-compile-asm-1.c: New test.
* gcc.target/arm/dlstp-compile-asm-2.c: New test.
* gcc.target/arm/dlstp-compile-asm-3.c: New test.
* gcc.target/arm/dlstp-int8x16.c: New test.
* gcc.target/arm/dlstp-int8x16-run.c: New test.
* gcc.target/arm/dlstp-int16x8.c: New test.
* gcc.target/arm/dlstp-int16x8-run.c: New test.
* gcc.target/arm/dlstp-int32x4.c: New test.
* gcc.target/arm/dlstp-int32x4-run.c: New test.
* gcc.target/arm/dlstp-int64x2.c: New test.
* gcc.target/arm/dlstp-int64x2-run.c: New test.
* gcc.target/arm/dlstp-invalid-asm.c: New test.

Co-authored-by: Stam Markianos-Wright 
---
 gcc/config/arm/arm-protos.h   |4 +-
 gcc/config/arm/arm.cc | 1249 -
 gcc/config/arm/arm.opt|3 +
 gcc/config/arm/iterators.md   |   15 +
 gcc/config/arm/mve.md |   50 +
 gcc/config/arm/thumb2.md  |  138 +-
 gcc/config/arm/types.md   |6 +-
 gcc/config/arm/unspecs.md |   14 +-
 gcc/testsuite/gcc.target/arm/lob.h|  128 +-
 gcc/testsuite/gcc.target/arm/lob1.c   |   23 +-
 gcc/testsuite/gcc.target/arm/lob6.c   |8 +-
 .../gcc.target/arm/mve/dlstp-compile-asm-1.c  |  146 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-2.c  |  749 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-3.c  |   46 +
 .../gcc.target/arm/mve/dlstp-int16x8-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int16x8.c|   31 +
 .../gcc.target/arm/mve/dlstp-int32x4-run.c|   45 +
 .../gcc.target/arm/mve/dlstp-int32x4.c|   31 +
 .../gcc.target/arm/mve/dlstp-int64x2-run.c|   48 +
 .../gcc.target/arm/mve/dlstp-int64x2.c|   28 +
 .../gcc.target/arm/mve/dlstp-int8x16-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int8x16.c|   32 +
 .../gcc.target/arm/mve/dlstp-invalid-asm.c|  521 +++
 23 files changed, 3321 insertions(+), 82 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-1.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-3.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
 create mode 100644 gcc/testsuite/gcc.target

[PATCH 0/2] arm, doloop: Add support for MVE Tail-Predicated Low Overhead Loops

2024-05-23 Thread Andre Vieira

Hi,

  We held these two patches back in stage 4 because they touched 
target-agnostic code, though I am quite confident they will not affect other 
targets. Given stage one has reopened, I am reposting them, I rebased them but 
they seem to apply cleanly on trunk.
  No changes from previously reviewed patches.

  OK for trunk?

Andre Vieira (2):
  doloop: Add support for predicated vectorized loops
  arm: Add support for MVE Tail-Predicated Low Overhead Loops

 gcc/config/arm/arm-protos.h   |4 +-
 gcc/config/arm/arm.cc | 1249 -
 gcc/config/arm/arm.opt|3 +
 gcc/config/arm/iterators.md   |   15 +
 gcc/config/arm/mve.md |   50 +
 gcc/config/arm/thumb2.md  |  138 +-
 gcc/config/arm/types.md   |6 +-
 gcc/config/arm/unspecs.md |   14 +-
 gcc/df-core.cc|   15 +
 gcc/df.h  |1 +
 gcc/loop-doloop.cc|  164 ++-
 gcc/testsuite/gcc.target/arm/lob.h|  128 +-
 gcc/testsuite/gcc.target/arm/lob1.c   |   23 +-
 gcc/testsuite/gcc.target/arm/lob6.c   |8 +-
 .../gcc.target/arm/mve/dlstp-compile-asm-1.c  |  146 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-2.c  |  749 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-3.c  |   46 +
 .../gcc.target/arm/mve/dlstp-int16x8-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int16x8.c|   31 +
 .../gcc.target/arm/mve/dlstp-int32x4-run.c|   45 +
 .../gcc.target/arm/mve/dlstp-int32x4.c|   31 +
 .../gcc.target/arm/mve/dlstp-int64x2-run.c|   48 +
 .../gcc.target/arm/mve/dlstp-int64x2.c|   28 +
 .../gcc.target/arm/mve/dlstp-int8x16-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int8x16.c|   32 +
 .../gcc.target/arm/mve/dlstp-invalid-asm.c|  521 +++
 26 files changed, 3434 insertions(+), 149 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-1.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-3.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int32x4.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int64x2.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int8x16.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c

-- 
2.17.1


[PATCH 1/2] doloop: Add support for predicated vectorized loops

2024-05-23 Thread Andre Vieira

This patch adds support in the target agnostic doloop pass for the detection of
predicated vectorized hardware loops.  Arm is currently the only target that
will make use of this feature.

gcc/ChangeLog:

* df-core.cc (df_bb_regno_only_def_find): New helper function.
* df.h (df_bb_regno_only_def_find): Declare new function.
* loop-doloop.cc (doloop_condition_get): Add support for detecting
predicated vectorized hardware loops.
(doloop_modify): Add support for GTU condition checks.
(doloop_optimize): Update costing computation to support alterations to
desc->niter_expr by the backend.

Co-authored-by: Stam Markianos-Wright 
---
 gcc/df-core.cc |  15 +
 gcc/df.h   |   1 +
 gcc/loop-doloop.cc | 164 +++--
 3 files changed, 113 insertions(+), 67 deletions(-)

diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index f0eb4c93957..b0e8a88d433 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -1964,6 +1964,21 @@ df_bb_regno_last_def_find (basic_block bb, unsigned int regno)
   return NULL;
 }
 
+/* Return the one and only def of REGNO within BB.  If there is no def or
+   there are multiple defs, return NULL.  */
+
+df_ref
+df_bb_regno_only_def_find (basic_block bb, unsigned int regno)
+{
+  df_ref temp = df_bb_regno_first_def_find (bb, regno);
+  if (!temp)
+return NULL;
+  else if (temp == df_bb_regno_last_def_find (bb, regno))
+return temp;
+  else
+return NULL;
+}
+
 /* Finds the reference corresponding to the definition of REG in INSN.
DF is the dataflow object.  */
 
diff --git a/gcc/df.h b/gcc/df.h
index 84e5aa8b524..c4e690b40cf 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -987,6 +987,7 @@ extern void df_check_cfg_clean (void);
 #endif
 extern df_ref df_bb_regno_first_def_find (basic_block, unsigned int);
 extern df_ref df_bb_regno_last_def_find (basic_block, unsigned int);
+extern df_ref df_bb_regno_only_def_find (basic_block, unsigned int);
 extern df_ref df_find_def (rtx_insn *, rtx);
 extern bool df_reg_defined (rtx_insn *, rtx);
 extern df_ref df_find_use (rtx_insn *, rtx);
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 529e810e530..8953e1de960 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -85,10 +85,10 @@ doloop_condition_get (rtx_insn *doloop_pat)
  forms:
 
  1)  (parallel [(set (pc) (if_then_else (condition)
-	  			(label_ref (label))
-(pc)))
-	 (set (reg) (plus (reg) (const_int -1)))
-	 (additional clobbers and uses)])
+	(label_ref (label))
+	(pc)))
+		 (set (reg) (plus (reg) (const_int -1)))
+		 (additional clobbers and uses)])
 
  The branch must be the first entry of the parallel (also required
  by jump.cc), and the second entry of the parallel must be a set of
@@ -96,19 +96,33 @@ doloop_condition_get (rtx_insn *doloop_pat)
  the loop counter in an if_then_else too.
 
  2)  (set (reg) (plus (reg) (const_int -1))
- (set (pc) (if_then_else (reg != 0)
-	 (label_ref (label))
-			 (pc))).  
+	 (set (pc) (if_then_else (reg != 0)
+ (label_ref (label))
+ (pc))).
 
- Some targets (ARM) do the comparison before the branch, as in the
+ 3) Some targets (Arm) do the comparison before the branch, as in the
  following form:
 
- 3) (parallel [(set (cc) (compare ((plus (reg) (const_int -1), 0)))
-   (set (reg) (plus (reg) (const_int -1)))])
-(set (pc) (if_then_else (cc == NE)
-(label_ref (label))
-(pc))) */
-
+ (parallel [(set (cc) (compare (plus (reg) (const_int -1)) 0))
+		(set (reg) (plus (reg) (const_int -1)))])
+ (set (pc) (if_then_else (cc == NE)
+			 (label_ref (label))
+			 (pc)))
+
+  4) This form supports a construct that is used to represent a vectorized
+  do loop with predication, however we do not need to care about the
+  details of the predication here.
+  Arm uses this construct to support MVE tail predication.
+
+  (parallel
+   [(set (pc)
+	 (if_then_else (gtu (plus (reg) (const_int -n))
+(const_int n-1))
+			   (label_ref)
+			   (pc)))
+	(set (reg) (plus (reg) (const_int -n)))
+	(additional clobbers and uses)])
+ */
   pattern = PATTERN (doloop_pat);
 
   if (GET_CODE (pattern) != PARALLEL)
@@ -173,15 +187,17 @@ doloop_condition_get (rtx_insn *doloop_pat)
   if (! REG_P (reg))
 return 0;
 
-  /* Check if something = (plus (reg) (const_int -1)).
+  /* Check if something = (plus (reg) (const_int -n)).
  On IA-64, this decrement is wrapped in an if_then_else.  */
   inc_src = SET_SRC (inc);
   if (GET_CODE (inc_src) == IF_THEN_ELSE)
 inc_src = XEXP (inc_src, 1);
   if (GET_CODE (inc_src) != PLUS
-  || XEXP (inc_src, 0) != reg
-  || XEXP (inc_src, 1) != constm1_rtx)
+  || !rtx_equal_p (XEXP (inc_src, 0), reg)
+  || !CONST_IN

[PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2024-05-23 Thread Andre Vieira

This patch adds support for MVE Tail-Predicated Low Overhead Loops by using the
doloop funcitonality added to support predicated vectorized hardware loops.

gcc/ChangeLog:

* config/arm/arm-protos.h (arm_target_bb_ok_for_lob): Change
declaration to pass basic_block.
(arm_attempt_dlstp_transform): New declaration.
* config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): Define targethook.
(TARGET_PREDICT_DOLOOP_P): Likewise.
(arm_target_bb_ok_for_lob): Adapt condition.
(arm_mve_get_vctp_lanes): New function.
(arm_dl_usage_type): New internal enum.
(arm_get_required_vpr_reg): New function.
(arm_get_required_vpr_reg_param): New function.
(arm_get_required_vpr_reg_ret_val): New function.
(arm_mve_get_loop_vctp): New function.
(arm_mve_insn_predicated_by): New function.
(arm_mve_across_lane_insn_p): New function.
(arm_mve_load_store_insn_p): New function.
(arm_mve_impl_pred_on_outputs_p): New function.
(arm_mve_impl_pred_on_inputs_p): New function.
(arm_last_vect_def_insn): New function.
(arm_mve_impl_predicated_p): New function.
(arm_mve_check_reg_origin_is_num_elems): New function.
(arm_mve_dlstp_check_inc_counter): New function.
(arm_mve_dlstp_check_dec_counter): New function.
(arm_mve_loop_valid_for_dlstp): New function.
(arm_predict_doloop_p): New function.
(arm_loop_unroll_adjust): New function.
(arm_emit_mve_unpredicated_insn_to_seq): New function.
(arm_attempt_dlstp_transform): New function.
* config/arm/arm.opt (mdlstp): New option.
* config/arm/iteratords.md (dlstp_elemsize, letp_num_lanes,
letp_num_lanes_neg, letp_num_lanes_minus_1): New attributes.
(DLSTP, LETP): New iterators.
(predicated_doloop_end_internal): New pattern.
(dlstp_insn): New pattern.
* config/arm/thumb2.md (doloop_end): Adapt to support tail-predicated
loops.
(doloop_begin): Likewise.
* config/arm/types.md (mve_misc): New mve type to represent
predicated_loop_end insn sequences.
* config/arm/unspecs.md:
(DLSTP8, DLSTP16, DLSTP32, DSLTP64,
LETP8, LETP16, LETP32, LETP64): New unspecs for DLSTP and LETP.

gcc/testsuite/ChangeLog:

* gcc.target/arm/lob.h: Add new helpers.
* gcc.target/arm/lob1.c: Use new helpers.
* gcc.target/arm/lob6.c: Likewise.
* gcc.target/arm/dlstp-compile-asm-1.c: New test.
* gcc.target/arm/dlstp-compile-asm-2.c: New test.
* gcc.target/arm/dlstp-compile-asm-3.c: New test.
* gcc.target/arm/dlstp-int8x16.c: New test.
* gcc.target/arm/dlstp-int8x16-run.c: New test.
* gcc.target/arm/dlstp-int16x8.c: New test.
* gcc.target/arm/dlstp-int16x8-run.c: New test.
* gcc.target/arm/dlstp-int32x4.c: New test.
* gcc.target/arm/dlstp-int32x4-run.c: New test.
* gcc.target/arm/dlstp-int64x2.c: New test.
* gcc.target/arm/dlstp-int64x2-run.c: New test.
* gcc.target/arm/dlstp-invalid-asm.c: New test.

Co-authored-by: Stam Markianos-Wright 
---
 gcc/config/arm/arm-protos.h   |4 +-
 gcc/config/arm/arm.cc | 1249 -
 gcc/config/arm/arm.opt|3 +
 gcc/config/arm/iterators.md   |   15 +
 gcc/config/arm/mve.md |   50 +
 gcc/config/arm/thumb2.md  |  138 +-
 gcc/config/arm/types.md   |6 +-
 gcc/config/arm/unspecs.md |   14 +-
 gcc/testsuite/gcc.target/arm/lob.h|  128 +-
 gcc/testsuite/gcc.target/arm/lob1.c   |   23 +-
 gcc/testsuite/gcc.target/arm/lob6.c   |8 +-
 .../gcc.target/arm/mve/dlstp-compile-asm-1.c  |  146 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-2.c  |  749 ++
 .../gcc.target/arm/mve/dlstp-compile-asm-3.c  |   46 +
 .../gcc.target/arm/mve/dlstp-int16x8-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int16x8.c|   31 +
 .../gcc.target/arm/mve/dlstp-int32x4-run.c|   45 +
 .../gcc.target/arm/mve/dlstp-int32x4.c|   31 +
 .../gcc.target/arm/mve/dlstp-int64x2-run.c|   48 +
 .../gcc.target/arm/mve/dlstp-int64x2.c|   28 +
 .../gcc.target/arm/mve/dlstp-int8x16-run.c|   44 +
 .../gcc.target/arm/mve/dlstp-int8x16.c|   32 +
 .../gcc.target/arm/mve/dlstp-invalid-asm.c|  521 +++
 23 files changed, 3321 insertions(+), 82 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-1.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-3.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8-run.c
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/dlstp-int16x8.c
 create mode 100644 gcc/testsuite/gcc.target

Re: [PATCH v2][GCC] Algorithmic optimization in match and simplify

2015-09-16 Thread Andre Vieira

On 03/09/15 12:11, Andre Vieira wrote:

On 01/09/15 15:01, Richard Biener wrote:

On Tue, Sep 1, 2015 at 3:40 PM, Andre Vieira
 wrote:

Hi Marc,

On 28/08/15 19:07, Marc Glisse wrote:


(not a review, I haven't even read the whole patch)

On Fri, 28 Aug 2015, Andre Vieira wrote:


2015-08-03  Andre Vieira  

* match.pd: Added new patterns:
  ((X {&,<<,>>} C0) {|,^} C1) {^,|} C2)
  (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>} C1)



+(for op0 (rshift rshift lshift lshift bit_and bit_and)
+ op1 (bit_ior bit_xor bit_ior bit_xor bit_ior bit_xor)
+ op2 (bit_xor bit_ior bit_xor bit_ior bit_xor bit_ior)

You can nest for-loops, it seems clearer as:
(for op0 (rshift lshift bit_and)
 (for op1 (bit_ior bit_xor)
  op2 (bit_xor bit_ior)



Will do, thank you for pointing it out.



+(simplify
+ (op2:c
+  (op1:c
+   (op0 @0 INTEGER_CST@1) INTEGER_CST@2) INTEGER_CST@3)

I suspect you will want more :s (single_use) and less :c (canonicalization
should put constants in second position).


I can't find the definition of :s (single_use).


Sorry for that - didn't get along updating it yet :/  It restricts matching to
sub-expressions that have a single-use.  So

+  a &= 0xd123;
+  unsigned short tem = a ^ 0x6040;
+  a = tem | 0xc031; /* Simplify _not_ to ((a & 0xd123) | 0xe071).  */
... use of tem ...

we shouldn't do the simplifcation here because the expression
(a & 0x123) ^ 0x6040 is kept live by 'tem'.


GCC internals do point out
that canonicalization does put constants in the second position, didnt see
that first. Thank you for pointing it out.


+   C1 = wi::bit_and_not (C1,C2);

Space after ','.


Will do.


Having wide_int_storage in many places is surprising, I can't find similar
code anywhere else in gcc.




I tried looking for examples of something similar, I think I ended up using
wide_int because I was able to convert easily to and from it and it has the
"mask" and "wide_int_to_tree" functions. I welcome any suggestions on what I
should be using here for integer constant transformations and comparisons.


Using wide-ints is fine, but you shouldn't need 'wide_int_storage'
constructors - those
are indeed odd.  Is it just for the initializers of wide-ints?

+wide_int zero_mask = wi::zero (prec);
+wide_int C0 = wide_int_storage (@1);
+wide_int C1 = wide_int_storage (@2);
+wide_int C2 = wide_int_storage (@3);
...
+   zero_mask = wide_int_storage (wi::mask (C0.to_uhwi (), false, prec));

tree_to_uhwi (@1) should do the trick as well

+   C1 = wi::bit_and_not (C1,C2);
+   cst_emit = wi::bit_or (C1, C2);

the ops should be replacable with @2 and @3, the store to C1 obviously not
(but you can use a tree temporary and use wide_int_to_tree here to avoid
the back-and-forth for the case where C1 is not assigned to).

Note that transforms only doing association are prone to endless recursion
in case some other pattern does the reverse op...

Richard.



BR,
Andre




Thank you for all the comments, see reworked version:

Two new algorithmic optimisations:
1.((X op0 C0) op1 C1) op2 C2)
  with op0 = {&, >>, <<}, op1 = {|,^}, op2 = {|,^} and op1 != op2
  zero_mask has 1's for all bits that are sure to be 0 in (X op0 C0)
  and 0's otherwise.
  if (op1 == '^') C1 &= ~C2 (Only changed if actually emitted)
  if ((C1 & ~zero_mask) == 0) then emit (X op0 C0) op2 (C1 op2 C2)
  if ((C2 & ~zero_mask) == 0) then emit (X op0 C0) op1 (C1 op2 C2)
2. (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>} C1)


This patch does two algorithmic optimisations that target patterns like:
(((x << 24) | 0x00FF) ^ 0xFF00) and ((x ^ 0x4002) >> 1) |
0x8000.

The transformation uses the knowledge of which bits are zero after
operations like (X {&,<<,(unsigned)>>}) to combine constants, reducing
run-time operations.
The two examples above would be transformed into (X << 24) ^ 0x
and (X >> 1) ^ 0xa001 respectively.

The second transformation enables more applications of the first. Also
some targets may benefit from delaying shift operations. I am aware that
such an optimization, in combination with one or more optimizations that
cause the reverse transformation, may lead to an infinite loop. Though
such behavior has not been detected during regression testing and
bootstrapping on aarch64.

gcc/ChangeLog:

2015-08-03  Andre Vieira  

* match.pd: Added new patterns:
  ((X {&,<<,>>} C0) {|,^} C1) {^,|} C2)
  (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>} C1)

gcc/testsuite/ChangeLog:

2015-08-03  Andre Vieira  
  Hale Wang  

* gcc.dg/tree-ssa/forwprop-33.c: New test.



Ping.



Re: [PATCH v2][GCC] Algorithmic optimization in match and simplify

2015-09-25 Thread Andre Vieira

On 17/09/15 10:46, Richard Biener wrote:

On Thu, Sep 3, 2015 at 1:11 PM, Andre Vieira
 wrote:

On 01/09/15 15:01, Richard Biener wrote:


On Tue, Sep 1, 2015 at 3:40 PM, Andre Vieira
 wrote:


Hi Marc,

On 28/08/15 19:07, Marc Glisse wrote:



(not a review, I haven't even read the whole patch)

On Fri, 28 Aug 2015, Andre Vieira wrote:


2015-08-03  Andre Vieira  

* match.pd: Added new patterns:
  ((X {&,<<,>>} C0) {|,^} C1) {^,|} C2)
  (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>}
C1)




+(for op0 (rshift rshift lshift lshift bit_and bit_and)
+ op1 (bit_ior bit_xor bit_ior bit_xor bit_ior bit_xor)
+ op2 (bit_xor bit_ior bit_xor bit_ior bit_xor bit_ior)

You can nest for-loops, it seems clearer as:
(for op0 (rshift lshift bit_and)
 (for op1 (bit_ior bit_xor)
  op2 (bit_xor bit_ior)




Will do, thank you for pointing it out.




+(simplify
+ (op2:c
+  (op1:c
+   (op0 @0 INTEGER_CST@1) INTEGER_CST@2) INTEGER_CST@3)

I suspect you will want more :s (single_use) and less :c
(canonicalization
should put constants in second position).


I can't find the definition of :s (single_use).



Sorry for that - didn't get along updating it yet :/  It restricts
matching to
sub-expressions that have a single-use.  So

+  a &= 0xd123;
+  unsigned short tem = a ^ 0x6040;
+  a = tem | 0xc031; /* Simplify _not_ to ((a & 0xd123) | 0xe071).  */
... use of tem ...

we shouldn't do the simplifcation here because the expression
(a & 0x123) ^ 0x6040 is kept live by 'tem'.


GCC internals do point out
that canonicalization does put constants in the second position, didnt
see
that first. Thank you for pointing it out.


+   C1 = wi::bit_and_not (C1,C2);

Space after ','.


Will do.


Having wide_int_storage in many places is surprising, I can't find
similar
code anywhere else in gcc.




I tried looking for examples of something similar, I think I ended up
using
wide_int because I was able to convert easily to and from it and it has
the
"mask" and "wide_int_to_tree" functions. I welcome any suggestions on
what I
should be using here for integer constant transformations and
comparisons.



Using wide-ints is fine, but you shouldn't need 'wide_int_storage'
constructors - those
are indeed odd.  Is it just for the initializers of wide-ints?

+wide_int zero_mask = wi::zero (prec);
+wide_int C0 = wide_int_storage (@1);
+wide_int C1 = wide_int_storage (@2);
+wide_int C2 = wide_int_storage (@3);
...
+   zero_mask = wide_int_storage (wi::mask (C0.to_uhwi (), false,
prec));

tree_to_uhwi (@1) should do the trick as well

+   C1 = wi::bit_and_not (C1,C2);
+   cst_emit = wi::bit_or (C1, C2);

the ops should be replacable with @2 and @3, the store to C1 obviously not
(but you can use a tree temporary and use wide_int_to_tree here to avoid
the back-and-forth for the case where C1 is not assigned to).

Note that transforms only doing association are prone to endless recursion
in case some other pattern does the reverse op...

Richard.



BR,
Andre




Thank you for all the comments, see reworked version:

Two new algorithmic optimisations:
   1.((X op0 C0) op1 C1) op2 C2)
 with op0 = {&, >>, <<}, op1 = {|,^}, op2 = {|,^} and op1 != op2
 zero_mask has 1's for all bits that are sure to be 0 in (X op0 C0)
 and 0's otherwise.
 if (op1 == '^') C1 &= ~C2 (Only changed if actually emitted)
 if ((C1 & ~zero_mask) == 0) then emit (X op0 C0) op2 (C1 op2 C2)
 if ((C2 & ~zero_mask) == 0) then emit (X op0 C0) op1 (C1 op2 C2)
   2. (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>} C1)


This patch does two algorithmic optimisations that target patterns like:
(((x << 24) | 0x00FF) ^ 0xFF00) and ((x ^ 0x4002) >> 1) |
0x8000.

The transformation uses the knowledge of which bits are zero after
operations like (X {&,<<,(unsigned)>>}) to combine constants, reducing
run-time operations.
The two examples above would be transformed into (X << 24) ^ 0x and
(X >> 1) ^ 0xa001 respectively.

The second transformation enables more applications of the first. Also some
targets may benefit from delaying shift operations. I am aware that such an
optimization, in combination with one or more optimizations that cause the
reverse transformation, may lead to an infinite loop. Though such behavior
has not been detected during regression testing and bootstrapping on
aarch64.


+/* (X bit_op C0) rshift C1 -> (X rshift C0) bit_op (C0 rshift C1) */
+(for bit_op (bit_ior bit_xor bit_and)
+(simplify
+ (rshift (bit_op:s @0 INTEGER_CST@1) INTEGER_CST@2)
+ (bit_op
+  (rshift @0 @2)
+  { wide_int_to_tree (type, wi::rshift (@1, @2, TYPE_SIGN (type))); })))
+
+

Re: [gcc-5-branch][PATCH][AARCH64]Fix for branch offsets over 1 MiB

2015-09-25 Thread Andre Vieira

Ping.

On 11/09/15 18:15, Andre Vieira wrote:

Conditional branches have a maximum range of [-1048576, 1048572]. Any
destination further away can not be reached by these.
To be able to have conditional branches in very large functions, we
invert the condition and change the destination to jump over an
unconditional branch to the original, far away, destination.

This patch backports the fix from trunk to the gcc-5-branch.
The original patch is at:
https://gcc.gnu.org/ml/gcc-patches/2015-08/msg01493.html

gcc/ChangeLog:
2015-09-09  Andre Vieira  

Backport from mainline:
2015-08-27  Ramana Radhakrishnan  
Andre Vieira  

* config/aarch64/aarch64.md (*condjump): Handle functions > 1 MiB.
(*cb1): Likewise.
(*tb1): Likewise.
(*cb1): Likewise.
* config/aarch64/iterators.md (inv_cb): New code attribute.
(inv_tb): Likewise.
* config/aarch64/aarch64.c (aarch64_gen_far_branch): New.
* config/aarch64/aarch64-protos.h (aarch64_gen_far_branch): New.


gcc/testsuite/ChangeLog:
2015-09-09  Andre Vieira  

Backport from mainline:
2015-08-27  Andre Vieira  

* gcc.target/aarch64/long_branch_1.c: New test.





[PATCH V3][GCC] Algorithmic optimization in match and simplify

2015-10-07 Thread Andre Vieira

On 25/09/15 12:42, Richard Biener wrote:

On Fri, Sep 25, 2015 at 1:30 PM, Andre Vieira
 wrote:

On 17/09/15 10:46, Richard Biener wrote:


On Thu, Sep 3, 2015 at 1:11 PM, Andre Vieira
 wrote:


On 01/09/15 15:01, Richard Biener wrote:



On Tue, Sep 1, 2015 at 3:40 PM, Andre Vieira
 wrote:



Hi Marc,

On 28/08/15 19:07, Marc Glisse wrote:




(not a review, I haven't even read the whole patch)

On Fri, 28 Aug 2015, Andre Vieira wrote:


2015-08-03  Andre Vieira  

 * match.pd: Added new patterns:
   ((X {&,<<,>>} C0) {|,^} C1) {^,|} C2)
   (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>}
C1)





+(for op0 (rshift rshift lshift lshift bit_and bit_and)
+ op1 (bit_ior bit_xor bit_ior bit_xor bit_ior bit_xor)
+ op2 (bit_xor bit_ior bit_xor bit_ior bit_xor bit_ior)

You can nest for-loops, it seems clearer as:
(for op0 (rshift lshift bit_and)
  (for op1 (bit_ior bit_xor)
   op2 (bit_xor bit_ior)





Will do, thank you for pointing it out.





+(simplify
+ (op2:c
+  (op1:c
+   (op0 @0 INTEGER_CST@1) INTEGER_CST@2) INTEGER_CST@3)

I suspect you will want more :s (single_use) and less :c
(canonicalization
should put constants in second position).


I can't find the definition of :s (single_use).




Sorry for that - didn't get along updating it yet :/  It restricts
matching to
sub-expressions that have a single-use.  So

+  a &= 0xd123;
+  unsigned short tem = a ^ 0x6040;
+  a = tem | 0xc031; /* Simplify _not_ to ((a & 0xd123) | 0xe071).  */
... use of tem ...

we shouldn't do the simplifcation here because the expression
(a & 0x123) ^ 0x6040 is kept live by 'tem'.


GCC internals do point out
that canonicalization does put constants in the second position, didnt
see
that first. Thank you for pointing it out.


+   C1 = wi::bit_and_not (C1,C2);

Space after ','.


Will do.


Having wide_int_storage in many places is surprising, I can't find
similar
code anywhere else in gcc.




I tried looking for examples of something similar, I think I ended up
using
wide_int because I was able to convert easily to and from it and it has
the
"mask" and "wide_int_to_tree" functions. I welcome any suggestions on
what I
should be using here for integer constant transformations and
comparisons.




Using wide-ints is fine, but you shouldn't need 'wide_int_storage'
constructors - those
are indeed odd.  Is it just for the initializers of wide-ints?

+wide_int zero_mask = wi::zero (prec);
+wide_int C0 = wide_int_storage (@1);
+wide_int C1 = wide_int_storage (@2);
+wide_int C2 = wide_int_storage (@3);
...
+   zero_mask = wide_int_storage (wi::mask (C0.to_uhwi (), false,
prec));

tree_to_uhwi (@1) should do the trick as well

+   C1 = wi::bit_and_not (C1,C2);
+   cst_emit = wi::bit_or (C1, C2);

the ops should be replacable with @2 and @3, the store to C1 obviously
not
(but you can use a tree temporary and use wide_int_to_tree here to avoid
the back-and-forth for the case where C1 is not assigned to).

Note that transforms only doing association are prone to endless
recursion
in case some other pattern does the reverse op...

Richard.



BR,
Andre




Thank you for all the comments, see reworked version:

Two new algorithmic optimisations:
1.((X op0 C0) op1 C1) op2 C2)
  with op0 = {&, >>, <<}, op1 = {|,^}, op2 = {|,^} and op1 != op2
  zero_mask has 1's for all bits that are sure to be 0 in (X op0 C0)
  and 0's otherwise.
  if (op1 == '^') C1 &= ~C2 (Only changed if actually emitted)
  if ((C1 & ~zero_mask) == 0) then emit (X op0 C0) op2 (C1 op2 C2)
  if ((C2 & ~zero_mask) == 0) then emit (X op0 C0) op1 (C1 op2 C2)
2. (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>} C1)


This patch does two algorithmic optimisations that target patterns like:
(((x << 24) | 0x00FF) ^ 0xFF00) and ((x ^ 0x4002) >> 1) |
0x8000.

The transformation uses the knowledge of which bits are zero after
operations like (X {&,<<,(unsigned)>>}) to combine constants, reducing
run-time operations.
The two examples above would be transformed into (X << 24) ^ 0x
and
(X >> 1) ^ 0xa001 respectively.

The second transformation enables more applications of the first. Also
some
targets may benefit from delaying shift operations. I am aware that such
an
optimization, in combination with one or more optimizations that cause
the
reverse transformation, may lead to an infinite loop. Though such
behavior
has not been detected during regression testing and bootstrapping on
aarch64.



+/* (X bit_op C0) rshift C1 -> (X rshift C0) bit_op (C0 rshift C1) */
+(for bit_op (bit_ior bit_xor bit_and)
+(simplify
+ (rshift (bit_op:s @0 INTEGER_CST@

[PATCHv2, ARM, libgcc] New aeabi_idiv function for armv6-m

2015-10-13 Thread Andre Vieira
This patch ports the aeabi_idiv routine from Linaro Cortex-Strings 
(https://git.linaro.org/toolchain/cortex-strings.git), which was 
contributed by ARM under Free BSD license.


The new aeabi_idiv routine is used to replace the one in 
libgcc/config/arm/lib1funcs.S. This replacement happens within the 
Thumb1 wrapper. The new routine is under LGPLv3 license.


The main advantage of this version is that it can improve the 
performance of the aeabi_idiv function for Thumb1. This solution will 
also increase the code size. So it will only be used if 
__OPTIMIZE_SIZE__ is not defined.


Make check passed for armv6-m.

libgcc/ChangeLog:
2015-08-10  Hale Wang  
Andre Vieira  

  * config/arm/lib1funcs.S: Add new wrapper.
From 832a3d6af6f06399f70b5a4ac3727d55960c93b7 Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Fri, 21 Aug 2015 14:23:28 +0100
Subject: [PATCH] new wrapper idivmod

---
 libgcc/config/arm/lib1funcs.S | 250 --
 1 file changed, 217 insertions(+), 33 deletions(-)

diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index 252efcbd5385cc58a5ce1e48c6816d36a6f4c797..c9e544114590da8cde88382bea0f67206e593816 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -306,34 +306,12 @@ LSYM(Lend_fde):
 #ifdef __ARM_EABI__
 .macro THUMB_LDIV0 name signed
 #if defined(__ARM_ARCH_6M__)
-	.ifc \signed, unsigned
-	cmp	r0, #0
-	beq	1f
-	mov	r0, #0
-	mvn	r0, r0		@ 0x
-1:
-	.else
-	cmp	r0, #0
-	beq	2f
-	blt	3f
+
+	push	{r0, lr}
 	mov	r0, #0
-	mvn	r0, r0
-	lsr	r0, r0, #1	@ 0x7fff
-	b	2f
-3:	mov	r0, #0x80
-	lsl	r0, r0, #24	@ 0x8000
-2:
-	.endif
-	push	{r0, r1, r2}
-	ldr	r0, 4f
-	adr	r1, 4f
-	add	r0, r1
-	str	r0, [sp, #8]
-	@ We know we are not on armv4t, so pop pc is safe.
-	pop	{r0, r1, pc}
-	.align	2
-4:
-	.word	__aeabi_idiv0 - 4b
+	bl	SYM(__aeabi_idiv0)
+	pop	{r1, pc}
+
 #elif defined(__thumb2__)
 	.syntax unified
 	.ifc \signed, unsigned
@@ -945,7 +923,170 @@ LSYM(Lover7):
 	add	dividend, work
   .endif
 LSYM(Lgot_result):
-.endm	
+.endm
+
+#if defined(__prefer_thumb__) && !defined(__OPTIMIZE_SIZE__)
+/* If performance is preferred, the following functions are provided.  */
+
+/* Branch to div(n), and jump to label if curbit is lo than divisior.  */
+.macro BranchToDiv n, label
+	lsr	curbit, dividend, \n
+	cmp	curbit, divisor
+	blo	\label
+.endm
+
+/* Body of div(n).  Shift the divisor in n bits and compare the divisor
+   and dividend.  Update the dividend as the substruction result.  */
+.macro DoDiv n
+	lsr	curbit, dividend, \n
+	cmp	curbit, divisor
+	bcc	1f
+	lsl	curbit, divisor, \n
+	sub	dividend, dividend, curbit
+
+1:	adc	result, result
+.endm
+
+/* The body of division with positive divisor.  Unless the divisor is very
+   big, shift it up in multiples of four bits, since this is the amount of
+   unwinding in the main division loop.  Continue shifting until the divisor
+   is larger than the dividend.  */
+.macro THUMB1_Div_Positive
+	mov	result, #0
+	BranchToDiv #1, LSYM(Lthumb1_div1)
+	BranchToDiv #4, LSYM(Lthumb1_div4)
+	BranchToDiv #8, LSYM(Lthumb1_div8)
+	BranchToDiv #12, LSYM(Lthumb1_div12)
+	BranchToDiv #16, LSYM(Lthumb1_div16)
+LSYM(Lthumb1_div_large_positive):
+	mov	result, #0xff
+	lsl	divisor, divisor, #8
+	rev	result, result
+	lsr	curbit, dividend, #16
+	cmp	curbit, divisor
+	blo	1f
+	asr	result, #8
+	lsl	divisor, divisor, #8
+	beq	LSYM(Ldivbyzero_waypoint)
+
+1:	lsr	curbit, dividend, #12
+	cmp	curbit, divisor
+	blo	LSYM(Lthumb1_div12)
+	b	LSYM(Lthumb1_div16)
+LSYM(Lthumb1_div_loop):
+	lsr	divisor, divisor, #8
+LSYM(Lthumb1_div16):
+	Dodiv	#15
+	Dodiv	#14
+	Dodiv	#13
+	Dodiv	#12
+LSYM(Lthumb1_div12):
+	Dodiv	#11
+	Dodiv	#10
+	Dodiv	#9
+	Dodiv	#8
+	bcs	LSYM(Lthumb1_div_loop)
+LSYM(Lthumb1_div8):
+	Dodiv	#7
+	Dodiv	#6
+	Dodiv	#5
+LSYM(Lthumb1_div5):
+	Dodiv	#4
+LSYM(Lthumb1_div4):
+	Dodiv	#3
+LSYM(Lthumb1_div3):
+	Dodiv	#2
+LSYM(Lthumb1_div2):
+	Dodiv	#1
+LSYM(Lthumb1_div1):
+	sub	divisor, dividend, divisor
+	bcs	1f
+	cpy	divisor, dividend
+
+1:	adc	result, result
+	cpy	dividend, result
+	RET
+
+LSYM(Ldivbyzero_waypoint):
+	b	LSYM(Ldiv0)
+.endm
+
+/* The body of division with negative divisor.  Similar with
+   THUMB1_Div_Positive except that the shift steps are in multiples
+   of six bits.  */
+.macro THUMB1_Div_Negative
+	lsr	result, divisor, #31
+	beq	1f
+	neg	divisor, divisor
+
+1:	asr	curbit, dividend, #32
+	bcc	2f
+	neg	dividend, dividend
+
+2:	eor	curbit, result
+	mov	result, #0
+	cpy	ip, curbit
+	BranchToDiv #4, LSYM(Lthumb1_div_negative4)
+	BranchToDiv #8, LSYM(Lthumb1_div_negative8)
+LSYM(Lthumb1_div_large):
+	mov	result, #0xfc
+	lsl	divisor, divisor, #6
+	rev	result, result
+	lsr	curbit, dividend, #8
+	cmp	curbit, divisor
+	blo	LSYM(Lthumb1_div_negative8)
+
+	lsl	divisor, divisor, #6
+	asr	result, result, #6
+	cmp	curbit, divisor
+	blo	LSYM(Lthumb1_div_negative8)
+
+	lsl	divisor, divisor, #6
+	asr	result, result, #6
+	cmp	curbit, divisor
+	

[PATCH][ARM] Fix for testcase after r228661

2015-10-20 Thread Andre Vieira

Hi,

This patch addresses PR-67948 by changing the xor-and.c test, initially 
written for a simplify-rtx pattern, to make it pass post r228661 (see 
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00676.html). This test no 
longer triggered the simplify-rtx pattern it was written for prior to 
r228661, though other optimizations did lead to the same assembly the 
test checked for. The optimization added with r228661 matches the 
pattern used in the test and optimizes it to a better and still valid 
sequence. Being unable to easily change the test to trigger the original 
simplify-rtx pattern, I chose to change it to pass with the new produced 
assembly sequence.


This is correct because the transformation is valid and it yields a more 
efficient pattern. However, as I pointed out before this test doesn't 
test the optimization it originally was intended for.


Tested by running regression tests for armv6.

Is this OK to commit?

Thanks,
Andre
From 89922547118e716b41ddf6edefb274322193f25c Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Thu, 15 Oct 2015 12:48:26 +0100
Subject: [PATCH] Fix for xor-and.c test

---
 gcc/testsuite/gcc.target/arm/xor-and.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.target/arm/xor-and.c b/gcc/testsuite/gcc.target/arm/xor-and.c
index 53dff85f8f780fb99a93bbcc24180a3d0d5d3be9..3715530cd7bf9ad8abb24cb21cd51ae3802079e8 100644
--- a/gcc/testsuite/gcc.target/arm/xor-and.c
+++ b/gcc/testsuite/gcc.target/arm/xor-and.c
@@ -10,6 +10,6 @@ unsigned short foo (unsigned short x)
   return x;
 }
 
-/* { dg-final { scan-assembler "orr" } } */
+/* { dg-final { scan-assembler "eor" } } */
 /* { dg-final { scan-assembler-not "mvn" } } */
 /* { dg-final { scan-assembler-not "uxth" } } */
-- 
1.9.1



Re: [PATCH][ARM] Fix for testcase after r228661

2015-10-20 Thread Andre Vieira



On 20/10/15 17:25, Ramana Radhakrishnan wrote:

On Tue, Oct 20, 2015 at 4:52 PM, Andre Vieira
 wrote:

Hi,

This patch addresses PR-67948 by changing the xor-and.c test, initially
written for a simplify-rtx pattern, to make it pass post r228661 (see
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg00676.html). This test no
longer triggered the simplify-rtx pattern it was written for prior to
r228661, though other optimizations did lead to the same assembly the test
checked for. The optimization added with r228661 matches the pattern used in
the test and optimizes it to a better and still valid sequence. Being unable
to easily change the test to trigger the original simplify-rtx pattern, I
chose to change it to pass with the new produced assembly sequence.

This is correct because the transformation is valid and it yields a more
efficient pattern. However, as I pointed out before this test doesn't test
the optimization it originally was intended for.

Tested by running regression tests for armv6.

Is this OK to commit?



Missing Changelog - please remember to put the PR number in the
Changelog in the correct format i.e PR testsuite/67948. Ok with that.

I suspect that the simplify-rtx.c is much less likely to trigger given
your match.pd change, but it will be fun proving that.
Ideally, I'd like to prove the simplify-rtx.c couldn't be triggered 
anymore. Though I wouldn't know where to begin with that. I did spend a 
tiny effort trying to trigger it with a version before the match.pd 
change, but with no success.


regards
ramana



Thanks,
Andre



Here is the ChangeLog (with the PR), sorry for that!

gcc/testsuite/ChangeLog
2015-10-15  Andre Vieira  

PR testsuite/67948
* gcc.target/arm/xor-and.c: check for eor instead of orr.
From 89922547118e716b41ddf6edefb274322193f25c Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Thu, 15 Oct 2015 12:48:26 +0100
Subject: [PATCH] Fix for xor-and.c test

---
 gcc/testsuite/gcc.target/arm/xor-and.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.target/arm/xor-and.c b/gcc/testsuite/gcc.target/arm/xor-and.c
index 53dff85f8f780fb99a93bbcc24180a3d0d5d3be9..3715530cd7bf9ad8abb24cb21cd51ae3802079e8 100644
--- a/gcc/testsuite/gcc.target/arm/xor-and.c
+++ b/gcc/testsuite/gcc.target/arm/xor-and.c
@@ -10,6 +10,6 @@ unsigned short foo (unsigned short x)
   return x;
 }
 
-/* { dg-final { scan-assembler "orr" } } */
+/* { dg-final { scan-assembler "eor" } } */
 /* { dg-final { scan-assembler-not "mvn" } } */
 /* { dg-final { scan-assembler-not "uxth" } } */
-- 
1.9.1



Re: [PING][PATCHv2, ARM, libgcc] New aeabi_idiv function for armv6-m

2015-10-27 Thread Andre Vieira

Ping.

BR,
Andre

On 13/10/15 18:01, Andre Vieira wrote:

This patch ports the aeabi_idiv routine from Linaro Cortex-Strings
(https://git.linaro.org/toolchain/cortex-strings.git), which was
contributed by ARM under Free BSD license.

The new aeabi_idiv routine is used to replace the one in
libgcc/config/arm/lib1funcs.S. This replacement happens within the
Thumb1 wrapper. The new routine is under LGPLv3 license.

The main advantage of this version is that it can improve the
performance of the aeabi_idiv function for Thumb1. This solution will
also increase the code size. So it will only be used if
__OPTIMIZE_SIZE__ is not defined.

Make check passed for armv6-m.

libgcc/ChangeLog:
2015-08-10  Hale Wang  
  Andre Vieira  

* config/arm/lib1funcs.S: Add new wrapper.





[PATCH][GCC] Make stackalign test LTO proof

2015-11-12 Thread Andre Vieira

Hi,

  This patch changes this testcase to make sure LTO will not optimize 
away the assignment of the local array to a global variable which was 
introduced to make sure stack space was made available for the test to work.


  This is correct because LTO is supposed to optimize this global away 
as at link time it knows this global will never be read. By adding a 
read of the global, LTO will no longer optimize it away.


  Tested by running regressions for this testcase for various ARM targets.

  Is this OK to commit?

  Thanks,
  Andre Vieira

gcc/testsuite/ChangeLog:
2015-11-06  Andre Vieira  

* gcc.dg/torture/stackalign/builtin-return-1.c: Added read
  to global such that a write is not optimized away by LTO.
From 6fbac447475c3b669baee84aa9d6291c3d09f1ab Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Fri, 6 Nov 2015 13:13:47 +
Subject: [PATCH] keep the stack testsuite fix

---
 gcc/testsuite/gcc.dg/torture/stackalign/builtin-return-1.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/gcc/testsuite/gcc.dg/torture/stackalign/builtin-return-1.c b/gcc/testsuite/gcc.dg/torture/stackalign/builtin-return-1.c
index af017532aeb3878ef7ad717a2743661a87a56b7d..1ccd109256de72419a3c71c2c1be6d07c423c005 100644
--- a/gcc/testsuite/gcc.dg/torture/stackalign/builtin-return-1.c
+++ b/gcc/testsuite/gcc.dg/torture/stackalign/builtin-return-1.c
@@ -39,5 +39,10 @@ int main(void)
   if (bar(1) != 2)
 abort();
 
+  /* Make sure there is a read of the global after the function call to bar
+   * such that LTO does not optimize away the assignment above.  */
+  if (g != dummy)
+abort();
+
   return 0;
 }
-- 
1.9.1



[PATCH][GCC][ARM] testcase memset-inline-10.c uses -mfloat-abi=hard but does not check whether target supports it

2015-11-12 Thread Andre Vieira

Hi,

  This patch changes the memset-inline-10.c testcase to make sure that 
it is only compiled for ARM targets that support -mfloat-abi=hard using 
the fact that all non-thumb1 targets do.


  This is correct because all targets for which -mthumb causes the 
compiler to use thumb2 will support the generation of FP instructions.


  Tested by running regressions for this testcase for various ARM targets.

  Is this OK to commit?

  Thanks,
  Andre Vieira

gcc/testsuite/ChangeLog:
2015-11-06  Andre Vieira  

* gcc.target/arm/memset-inline-10.c: Added
dg-require-effective-target arm_thumb2_ok.



Re: [PATCH][GCC][ARM] testcase memset-inline-10.c uses -mfloat-abi=hard but does not check whether target supports it

2015-11-12 Thread Andre Vieira

On 12/11/15 15:08, Andre Vieira wrote:

Hi,

   This patch changes the memset-inline-10.c testcase to make sure that
it is only compiled for ARM targets that support -mfloat-abi=hard using
the fact that all non-thumb1 targets do.

   This is correct because all targets for which -mthumb causes the
compiler to use thumb2 will support the generation of FP instructions.

   Tested by running regressions for this testcase for various ARM targets.

   Is this OK to commit?

   Thanks,
   Andre Vieira

gcc/testsuite/ChangeLog:
2015-11-06  Andre Vieira  

 * gcc.target/arm/memset-inline-10.c: Added
 dg-require-effective-target arm_thumb2_ok.


Now with attachment, sorry about that.

Cheers,
Andre
From f6515d9cceacf99d213aea1236b7027c7839ab4b Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Fri, 6 Nov 2015 14:48:27 +
Subject: [PATCH] added check for thumb2_ok

---
 gcc/testsuite/gcc.target/arm/memset-inline-10.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/testsuite/gcc.target/arm/memset-inline-10.c b/gcc/testsuite/gcc.target/arm/memset-inline-10.c
index c1087c8e693fb723ca9396108f5fe872ede167e9..ce51c1d9eeb800cf67790fe06817ae23215399e9 100644
--- a/gcc/testsuite/gcc.target/arm/memset-inline-10.c
+++ b/gcc/testsuite/gcc.target/arm/memset-inline-10.c
@@ -1,4 +1,5 @@
 /* { dg-do compile } */
+/* { dg-require-effective-target arm_thumb2_ok } */
 /* { dg-options "-march=armv7-a -mfloat-abi=hard -mfpu=neon -O2" } */
 /* { dg-skip-if "need SIMD instructions" { *-*-* } { "-mfloat-abi=soft" } { "" } } */
 /* { dg-skip-if "need SIMD instructions" { *-*-* } { "-mfpu=vfp*" } { "" } } */
-- 
1.9.1



[PATCH][GCC][ARM] Disable neon testing for armv7-m

2015-11-16 Thread Andre Vieira

Hi,

  This patch changes the target support mechanism to make it recognize 
any ARM 'M' profile as a non-neon supporting target. The current check 
only tests for armv6 architectures and earlier, and does not account for 
armv7-m.


  This is correct because there is no 'M' profile that supports neon 
and the current test is not sufficient to exclude armv7-m.


  Tested by running regressions for this testcase for various ARM targets.

  Is this OK to commit?

  Thanks,
  Andre Vieira

gcc/testsuite/ChangeLog:
2015-11-06  Andre Vieira  

* gcc/testsuite/lib/target-supports.exp
  (check_effective_target_arm_neon_ok_nocache): Added check
  for M profile.
From 2c53bb9ba3236919ecf137a4887abf26d4f7fda2 Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Fri, 13 Nov 2015 11:16:34 +
Subject: [PATCH] Disable neon testing for armv7-m

---
 gcc/testsuite/lib/target-supports.exp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 75d506829221e3d02d454631c4bd2acd1a8cedf2..8097a4621b088a93d58d09571cf7aa27b8d5fba6 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -2854,7 +2854,7 @@ proc check_effective_target_arm_neon_ok_nocache { } {
 		int dummy;
 		/* Avoid the case where a test adds -mfpu=neon, but the toolchain is
 		   configured for -mcpu=arm926ej-s, for example.  */
-		#if __ARM_ARCH < 7
+		#if __ARM_ARCH < 7 || __ARM_ARCH_PROFILE == 'M'
 		#error Architecture too old for NEON.
 		#endif
 	} "$flags"] } {
-- 
1.9.1



Re: [PATCH][GCC] Make stackalign test LTO proof

2015-11-16 Thread Andre Vieira

On 13/11/15 10:34, Richard Biener wrote:

On Thu, Nov 12, 2015 at 4:07 PM, Andre Vieira
 wrote:

Hi,

   This patch changes this testcase to make sure LTO will not optimize away
the assignment of the local array to a global variable which was introduced
to make sure stack space was made available for the test to work.

   This is correct because LTO is supposed to optimize this global away as at
link time it knows this global will never be read. By adding a read of the
global, LTO will no longer optimize it away.


But that's only because we can't see that bar doesn't clobber it, else
we would optimize away the check and get here again.  Much better
to mark 'dummy' with __attribute__((used)) and go away with 'g' entirely.

Richard.


   Tested by running regressions for this testcase for various ARM targets.

   Is this OK to commit?

   Thanks,
   Andre Vieira

gcc/testsuite/ChangeLog:
2015-11-06  Andre Vieira  

 * gcc.dg/torture/stackalign/builtin-return-1.c: Added read
   to global such that a write is not optimized away by LTO.



Hi Richard,

That would be great but __attribute__ ((used)) can only be used for 
static variables and making dummy static would defeat the purpose here.


Cheers,
Andre



Re: [PATCH][GCC][ARM] Disable neon testing for armv7-m

2015-11-16 Thread Andre Vieira

On 16/11/15 12:07, James Greenhalgh wrote:

On Mon, Nov 16, 2015 at 10:49:11AM +, Andre Vieira wrote:

Hi,

   This patch changes the target support mechanism to make it
recognize any ARM 'M' profile as a non-neon supporting target. The
current check only tests for armv6 architectures and earlier, and
does not account for armv7-m.

   This is correct because there is no 'M' profile that supports neon
and the current test is not sufficient to exclude armv7-m.

   Tested by running regressions for this testcase for various ARM targets.

   Is this OK to commit?

   Thanks,
   Andre Vieira

gcc/testsuite/ChangeLog:
2015-11-06  Andre Vieira  

 * gcc/testsuite/lib/target-supports.exp
   (check_effective_target_arm_neon_ok_nocache): Added check
   for M profile.



 From 2c53bb9ba3236919ecf137a4887abf26d4f7fda2 Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Fri, 13 Nov 2015 11:16:34 +
Subject: [PATCH] Disable neon testing for armv7-m

---
  gcc/testsuite/lib/target-supports.exp | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/lib/target-supports.exp 
b/gcc/testsuite/lib/target-supports.exp
index 
75d506829221e3d02d454631c4bd2acd1a8cedf2..8097a4621b088a93d58d09571cf7aa27b8d5fba6
 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -2854,7 +2854,7 @@ proc check_effective_target_arm_neon_ok_nocache { } {
int dummy;
/* Avoid the case where a test adds -mfpu=neon, but the 
toolchain is
   configured for -mcpu=arm926ej-s, for example.  */
-   #if __ARM_ARCH < 7
+   #if __ARM_ARCH < 7 || __ARM_ARCH_PROFILE == 'M'
#error Architecture too old for NEON.


Could you fix this #error message while you're here?

Why we can't change this test to look for the __ARM_NEON macro from ACLE:

#if __ARM_NEON < 1
   #error NEON is not enabled
#endif

Thanks,
James



There is a check for this already: 'check_effective_target_arm_neon'. I 
think the idea behind arm_neon_ok is to check whether the hardware would 
support neon, whereas arm_neon is to check whether neon was enabled, 
i.e. -mfpu=neon was used or a mcpu was passed that has neon enabled by 
default.


The comments for 'check_effective_target_arm_neon_ok_nocache' highlight 
this, though maybe the comments for check_effective_target_arm_neon 
could be better.


# Return 1 if this is an ARM target supporting -mfpu=neon
# -mfloat-abi=softfp or equivalent options.  Some multilibs may be
# incompatible with these options.  Also set et_arm_neon_flags to the
# best options to add.

proc check_effective_target_arm_neon_ok_nocache
...
/* Avoid the case where a test adds -mfpu=neon, but the toolchain is
   configured for -mcpu=arm926ej-s, for example.  */
...


and

# Return 1 if this is a ARM target with NEON enabled.

proc check_effective_target_arm_neon
...

Cheers,
Andre



Re: [PATCH][GCC] Make stackalign test LTO proof

2015-11-16 Thread Andre Vieira

On 16/11/15 13:33, Richard Biener wrote:

On Mon, Nov 16, 2015 at 12:43 PM, Andre Vieira
 wrote:

On 13/11/15 10:34, Richard Biener wrote:


On Thu, Nov 12, 2015 at 4:07 PM, Andre Vieira
 wrote:


Hi,

This patch changes this testcase to make sure LTO will not optimize
away
the assignment of the local array to a global variable which was
introduced
to make sure stack space was made available for the test to work.

This is correct because LTO is supposed to optimize this global away
as at
link time it knows this global will never be read. By adding a read of
the
global, LTO will no longer optimize it away.



But that's only because we can't see that bar doesn't clobber it, else
we would optimize away the check and get here again.  Much better
to mark 'dummy' with __attribute__((used)) and go away with 'g' entirely.

Richard.


Tested by running regressions for this testcase for various ARM
targets.

Is this OK to commit?

Thanks,
Andre Vieira

gcc/testsuite/ChangeLog:
2015-11-06  Andre Vieira  

  * gcc.dg/torture/stackalign/builtin-return-1.c: Added read
to global such that a write is not optimized away by LTO.




Hi Richard,

That would be great but __attribute__ ((used)) can only be used for static
variables and making dummy static would defeat the purpose here.


I see.  What about volatile?


Cheers,
Andre



Yeh a 'volatile char dummy[64];' with a read afterwards leads to the 
stack still being reserved after LTO and we won't need the global variable.


If you prefer that solution Ill respin the patch.

Cheers,
Andre



Re: [PATCH][GCC] Make stackalign test LTO proof

2015-11-16 Thread Andre Vieira

On 16/11/15 15:34, Joern Wolfgang Rennecke wrote:

I just happened to stumble on this problem with another port.
The volatile & test solution doesn't work, though.

What does work, however, is:

__asm__ ("" : : "" (dummy));


I can confirm that Joern's solution works for me too.



[embedded-5-branch][PATCH 0/2] Backporting algorithmic optimization and testcase change

2015-11-17 Thread Andre Vieira
This series is aimed at backporting algorithmic optimizations and a 
change to a test it affects from trunk to the embedded-5-branch.


Andre Vieira(2):
Backporting algorithmic optimization in match and simplify
Backporting fix for PR-67948.



[embedded-5-branch][PATCH 1/2] Backporting algorithmic optimization in match and simplify

2015-11-17 Thread Andre Vieira

New algorithmic optimisations:
((X inner_op C0) outer_op C1)
   With X being a tree where value_range has reasoned certain bits to 
always be
   zero throughout its computed value range, we will call this the 
zero_mask,

   and with inner_op = {|,^}, outer_op = {|,^} and inner_op != outer_op.
   if (inner_op == '^') C0 &= ~C1;
   if ((C0 & ~zero_mask) == 0) then emit (X outer_op (C0 outer_op C1)
   if ((C1 & ~zero_mask) == 0) then emit (X inner_op (C0 outer_op C1)

(X {&,^,|} C2) << C1 -> (X << C1) {&,^,|} (C2 << C1)
(X {&,^,|} C2) >> C1 -> (X >> C1) & (C2 >> C1)

The second and third transformations enable more applications of the 
first. Also some targets may benefit from delaying shift operations. I 
am aware that such an optimization, in combination with one or more 
optimizations that cause the reverse transformation, may lead to an 
infinite loop. Though such behavior has not been detected during 
regression testing and bootstrapping on aarch64.


This patch backports the optimization from trunk to the embedded-5-branch.

The original patches are at:
https://gcc.gnu.org/ml/gcc-patches/2015-08/msg01493.html (for the first 
optimization and changes to second and third)
https://gcc.gnu.org/ml/gcc-patches/2015-07/msg00517.html (the addition 
of the original second and third optimizations)


gcc/ChangeLog:

2015-11-27  Andre Vieira  
Backport from mainline:

  2015-10-09  Andre Vieira  

* match.pd: ((X inner_op C0) outer_op C1) New pattern.
((X & C2) << C1): Expand to...
(X {&,^,|} C2 << C1): ...This.
((X & C2) >> C1): Expand to...
(X {&,^,|} C2 >> C1): ...This.
  2015-07-07  Richard Biener  

* fold-const.c (fold_binary_loc): Move
(X & C2) << C1 -> (X << C1) & (C2 << C1) simplification ...
* match.pd: ... here.


gcc/testsuite/ChangeLog:

2015-11-27  Andre Vieira  
Backport from mainline:

  2015-10-09  Andre Vieira  
Hale Wang  

* gcc.dg/tree-ssa/forwprop-33.c: New.
From ac76659583ab0d9eb96552a540a22e66fdd430f7 Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Fri, 23 Oct 2015 15:35:27 +0100
Subject: [PATCH 1/2] backport algorithmic optimization

---
 gcc/fold-const.c| 21 -
 gcc/match.pd| 54 ++
 gcc/testsuite/gcc.dg/tree-ssa/forwprop-33.c | 71 +
 3 files changed, 125 insertions(+), 21 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/forwprop-33.c

diff --git a/gcc/fold-const.c b/gcc/fold-const.c
index 6d085b185895f715d99a53a9e706295aa983d52a..5e0477f974173627317a25e018c9a6e098b099fb 100644
--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -12080,27 +12080,6 @@ fold_binary_loc (location_t loc,
 			 prec) == 0)
 	return TREE_OPERAND (arg0, 0);
 
-  /* Fold (X & C2) << C1 into (X << C1) & (C2 << C1)
-	  (X & C2) >> C1 into (X >> C1) & (C2 >> C1)
-	 if the latter can be further optimized.  */
-  if ((code == LSHIFT_EXPR || code == RSHIFT_EXPR)
-	  && TREE_CODE (arg0) == BIT_AND_EXPR
-	  && TREE_CODE (arg1) == INTEGER_CST
-	  && TREE_CODE (TREE_OPERAND (arg0, 1)) == INTEGER_CST)
-	{
-	  tree mask = fold_build2_loc (loc, code, type,
-   fold_convert_loc (loc, type,
-		 TREE_OPERAND (arg0, 1)),
-   arg1);
-	  tree shift = fold_build2_loc (loc, code, type,
-fold_convert_loc (loc, type,
-		  TREE_OPERAND (arg0, 0)),
-arg1);
-	  tem = fold_binary_loc (loc, BIT_AND_EXPR, type, shift, mask);
-	  if (tem)
-	return tem;
-	}
-
   return NULL_TREE;
 
 case MIN_EXPR:
diff --git a/gcc/match.pd b/gcc/match.pd
index e40720e130fe0302466969e39ad5803960b942da..bea0237ae3b335c73bbc71f6d3997b1d84a58986 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -389,6 +389,49 @@ along with GCC; see the file COPYING3.  If not see
 	&& (TREE_CODE (@4) != SSA_NAME || has_single_use (@4)))
(bit_xor (bit_and (bit_xor @0 @1) @2) @0)))
 
+/* ((X inner_op C0) outer_op C1)
+   With X being a tree where value_range has reasoned certain bits to always be
+   zero throughout its computed value range,
+   inner_op = {|,^}, outer_op = {|,^} and inner_op != outer_op
+   where zero_mask has 1's for all bits that are sure to be 0 in
+   and 0's otherwise.
+   if (inner_op == '^') C0 &= ~C1;
+   if ((C0 & ~zero_mask) == 0) then emit (X outer_op (C0 outer_op C1)
+   if ((C1 & ~zero_mask) == 0) then emit (X inner_op (C0 outer_op C1)
+*/
+(for inner_op (bit_ior bit_xor)
+ outer_op (bit_xor bit_ior)
+(simplify
+ (outer_op
+  (inner_op@3 @2 INTEGER_CST@0) INTEGER_CST@1)
+ (with
+  {
+bool fail = false;
+wide_int zero_mask_not;
+wide_int C0;
+wide_int cst_emit;
+
+if (TREE_C

[embedded-5-branch][PATCH 2/2]Backporting fix for PR-67948.

2015-11-17 Thread Andre Vieira
This patch backports the fix for PR-67948 from trunk to the 
embedded-5-branch.


The original patch is at:
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg02193.html

Tested for Cortex-M3.

Is this OK to commit?

Thanks,
Andre

gcc/testsuite/ChangeLog
2015-10-27  Andre Vieira 
Backport from mainline:
  2015-10-15  Andre Vieira  
   PR testsuite/67948
   * gcc.target/arm/xor-and.c: check for eor instead of orr.
From 89922547118e716b41ddf6edefb274322193f25c Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Thu, 15 Oct 2015 12:48:26 +0100
Subject: [PATCH] Fix for xor-and.c test

---
 gcc/testsuite/gcc.target/arm/xor-and.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.target/arm/xor-and.c b/gcc/testsuite/gcc.target/arm/xor-and.c
index 53dff85f8f780fb99a93bbcc24180a3d0d5d3be9..3715530cd7bf9ad8abb24cb21cd51ae3802079e8 100644
--- a/gcc/testsuite/gcc.target/arm/xor-and.c
+++ b/gcc/testsuite/gcc.target/arm/xor-and.c
@@ -10,6 +10,6 @@ unsigned short foo (unsigned short x)
   return x;
 }
 
-/* { dg-final { scan-assembler "orr" } } */
+/* { dg-final { scan-assembler "eor" } } */
 /* { dg-final { scan-assembler-not "mvn" } } */
 /* { dg-final { scan-assembler-not "uxth" } } */
-- 
1.9.1



Re: [PATCH][GCC] Make stackalign test LTO proof

2015-11-17 Thread Andre Vieira

On 17/11/15 12:29, Bernd Schmidt wrote:

On 11/16/2015 04:48 PM, Andre Vieira wrote:

On 16/11/15 15:34, Joern Wolfgang Rennecke wrote:

I just happened to stumble on this problem with another port.
The volatile & test solution doesn't work, though.

What does work, however, is:

__asm__ ("" : : "" (dummy));


I can confirm that Joern's solution works for me too.


Ok to make that change.


Bernd


OK, Joern will you submit a patch for this or shall I?

Cheers,
Andre



Re: [PATCH][GCC][ARM] Disable neon testing for armv7-m

2015-11-18 Thread Andre Vieira

On 17/11/15 10:10, James Greenhalgh wrote:

On Mon, Nov 16, 2015 at 01:15:32PM +, Andre Vieira wrote:

On 16/11/15 12:07, James Greenhalgh wrote:

On Mon, Nov 16, 2015 at 10:49:11AM +, Andre Vieira wrote:

Hi,

   This patch changes the target support mechanism to make it
recognize any ARM 'M' profile as a non-neon supporting target. The
current check only tests for armv6 architectures and earlier, and
does not account for armv7-m.

   This is correct because there is no 'M' profile that supports neon
and the current test is not sufficient to exclude armv7-m.

   Tested by running regressions for this testcase for various ARM targets.

   Is this OK to commit?

   Thanks,
   Andre Vieira

gcc/testsuite/ChangeLog:
2015-11-06  Andre Vieira  

 * gcc/testsuite/lib/target-supports.exp
   (check_effective_target_arm_neon_ok_nocache): Added check
   for M profile.



 From 2c53bb9ba3236919ecf137a4887abf26d4f7fda2 Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Fri, 13 Nov 2015 11:16:34 +
Subject: [PATCH] Disable neon testing for armv7-m

---
  gcc/testsuite/lib/target-supports.exp | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/lib/target-supports.exp 
b/gcc/testsuite/lib/target-supports.exp
index 
75d506829221e3d02d454631c4bd2acd1a8cedf2..8097a4621b088a93d58d09571cf7aa27b8d5fba6
 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -2854,7 +2854,7 @@ proc check_effective_target_arm_neon_ok_nocache { } {
int dummy;
/* Avoid the case where a test adds -mfpu=neon, but the 
toolchain is
   configured for -mcpu=arm926ej-s, for example.  */
-   #if __ARM_ARCH < 7
+   #if __ARM_ARCH < 7 || __ARM_ARCH_PROFILE == 'M'
#error Architecture too old for NEON.


Could you fix this #error message while you're here?

Why we can't change this test to look for the __ARM_NEON macro from ACLE:

#if __ARM_NEON < 1
   #error NEON is not enabled
#endif

Thanks,
James



There is a check for this already:
'check_effective_target_arm_neon'. I think the idea behind
arm_neon_ok is to check whether the hardware would support neon,
whereas arm_neon is to check whether neon was enabled, i.e.
-mfpu=neon was used or a mcpu was passed that has neon enabled by
default.

The comments for 'check_effective_target_arm_neon_ok_nocache'
highlight this, though maybe the comments for
check_effective_target_arm_neon could be better.

# Return 1 if this is an ARM target supporting -mfpu=neon
# -mfloat-abi=softfp or equivalent options.  Some multilibs may be
# incompatible with these options.  Also set et_arm_neon_flags to the
# best options to add.

proc check_effective_target_arm_neon_ok_nocache
...
/* Avoid the case where a test adds -mfpu=neon, but the toolchain is
configured for -mcpu=arm926ej-s, for example.  */
...


and

# Return 1 if this is a ARM target with NEON enabled.

proc check_effective_target_arm_neon


OK, got it - sorry for my mistake, I had the two procs confused.

I'd still like to see the error message fixed "Architecture too old for NEON."
is not an accurate description of the problem.

Thanks,
James



This OK?

Cheers,
Andre
From cd58546931221197f83a82afa7ac14291d675d48 Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Fri, 13 Nov 2015 11:16:34 +
Subject: [PATCH] Disable neon testing for armv7-m

---
 gcc/testsuite/lib/target-supports.exp | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 75d506829221e3d02d454631c4bd2acd1a8cedf2..91ff64f3d0d6ede50dcdb30cccf43be90e376a44 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -2854,8 +2854,8 @@ proc check_effective_target_arm_neon_ok_nocache { } {
 		int dummy;
 		/* Avoid the case where a test adds -mfpu=neon, but the toolchain is
 		   configured for -mcpu=arm926ej-s, for example.  */
-		#if __ARM_ARCH < 7
-		#error Architecture too old for NEON.
+		#if __ARM_ARCH < 7 || __ARM_ARCH_PROFILE == 'M'
+		#error Architecture does not support NEON.
 		#endif
 	} "$flags"] } {
 		set et_arm_neon_flags $flags
-- 
1.9.1



Re: [PATCH][GCC][ARM] Disable neon testing for armv7-m

2015-11-20 Thread Andre Vieira

Hi Kyrill
On 20/11/15 11:51, Kyrill Tkachov wrote:

Hi Andre,

On 18/11/15 09:44, Andre Vieira wrote:

On 17/11/15 10:10, James Greenhalgh wrote:

On Mon, Nov 16, 2015 at 01:15:32PM +, Andre Vieira wrote:

On 16/11/15 12:07, James Greenhalgh wrote:

On Mon, Nov 16, 2015 at 10:49:11AM +, Andre Vieira wrote:

Hi,

   This patch changes the target support mechanism to make it
recognize any ARM 'M' profile as a non-neon supporting target. The
current check only tests for armv6 architectures and earlier, and
does not account for armv7-m.

   This is correct because there is no 'M' profile that supports neon
and the current test is not sufficient to exclude armv7-m.

   Tested by running regressions for this testcase for various ARM
targets.

   Is this OK to commit?

   Thanks,
   Andre Vieira

gcc/testsuite/ChangeLog:
2015-11-06  Andre Vieira 

 * gcc/testsuite/lib/target-supports.exp
   (check_effective_target_arm_neon_ok_nocache): Added check
   for M profile.



 From 2c53bb9ba3236919ecf137a4887abf26d4f7fda2 Mon Sep 17 00:00:00
2001
From: Andre Simoes Dias Vieira 
Date: Fri, 13 Nov 2015 11:16:34 +
Subject: [PATCH] Disable neon testing for armv7-m

---
  gcc/testsuite/lib/target-supports.exp | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/lib/target-supports.exp
b/gcc/testsuite/lib/target-supports.exp
index
75d506829221e3d02d454631c4bd2acd1a8cedf2..8097a4621b088a93d58d09571cf7aa27b8d5fba6
100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -2854,7 +2854,7 @@ proc
check_effective_target_arm_neon_ok_nocache { } {
  int dummy;
  /* Avoid the case where a test adds -mfpu=neon, but the
toolchain is
 configured for -mcpu=arm926ej-s, for example.  */
-#if __ARM_ARCH < 7
+#if __ARM_ARCH < 7 || __ARM_ARCH_PROFILE == 'M'
  #error Architecture too old for NEON.


Could you fix this #error message while you're here?

Why we can't change this test to look for the __ARM_NEON macro from
ACLE:

#if __ARM_NEON < 1
   #error NEON is not enabled
#endif

Thanks,
James



There is a check for this already:
'check_effective_target_arm_neon'. I think the idea behind
arm_neon_ok is to check whether the hardware would support neon,
whereas arm_neon is to check whether neon was enabled, i.e.
-mfpu=neon was used or a mcpu was passed that has neon enabled by
default.

The comments for 'check_effective_target_arm_neon_ok_nocache'
highlight this, though maybe the comments for
check_effective_target_arm_neon could be better.

# Return 1 if this is an ARM target supporting -mfpu=neon
# -mfloat-abi=softfp or equivalent options.  Some multilibs may be
# incompatible with these options.  Also set et_arm_neon_flags to the
# best options to add.

proc check_effective_target_arm_neon_ok_nocache
...
/* Avoid the case where a test adds -mfpu=neon, but the toolchain is
configured for -mcpu=arm926ej-s, for example.  */
...


and

# Return 1 if this is a ARM target with NEON enabled.

proc check_effective_target_arm_neon


OK, got it - sorry for my mistake, I had the two procs confused.

I'd still like to see the error message fixed "Architecture too old
for NEON."
is not an accurate description of the problem.

Thanks,
James



This OK?



This is ok,
I've committed for you with the slightly tweaked ChangeLog entry:
2015-11-20  Andre Vieira  

 * lib/target-supports.exp
 (check_effective_target_arm_neon_ok_nocache): Add check
 for M profile.

as r230653.

Thanks,
Kyrill



Cheers,
Andre




Thank you. Would there be any objections to backporting this to 
gcc-5-branch? I checked, it applies cleanly and its a simple enough way 
of preventing a lot of FAILS for armv7-m.


Best Regards,
Andre



Re: [Ping][PATCH][GCC][ARM] testcase memset-inline-10.c uses -mfloat-abi=hard but does not check whether target supports it

2015-11-27 Thread Andre Vieira

On 12/11/15 15:16, Andre Vieira wrote:

On 12/11/15 15:08, Andre Vieira wrote:

Hi,

   This patch changes the memset-inline-10.c testcase to make sure that
it is only compiled for ARM targets that support -mfloat-abi=hard using
the fact that all non-thumb1 targets do.

   This is correct because all targets for which -mthumb causes the
compiler to use thumb2 will support the generation of FP instructions.

   Tested by running regressions for this testcase for various ARM
targets.

   Is this OK to commit?

   Thanks,
   Andre Vieira

gcc/testsuite/ChangeLog:
2015-11-06  Andre Vieira  

 * gcc.target/arm/memset-inline-10.c: Added
 dg-require-effective-target arm_thumb2_ok.


Now with attachment, sorry about that.

Cheers,
Andre


Ping.



Re: [PATCH][GCC] Make stackalign test LTO proof

2015-12-04 Thread Andre Vieira

On 17/11/15 16:30, Andre Vieira wrote:

On 17/11/15 12:29, Bernd Schmidt wrote:

On 11/16/2015 04:48 PM, Andre Vieira wrote:

On 16/11/15 15:34, Joern Wolfgang Rennecke wrote:

I just happened to stumble on this problem with another port.
The volatile & test solution doesn't work, though.

What does work, however, is:

__asm__ ("" : : "" (dummy));


I can confirm that Joern's solution works for me too.


Ok to make that change.


Bernd


OK, Joern will you submit a patch for this or shall I?

Cheers,
Andre


Hi,

Reworked following Joern's suggestion.

Is this OK?

Cheers,
Andre

gcc/testsuite/ChangeLog:
2015-12-04  Andre Vieira  
Joern Rennecke  

* gcc.dg/torture/stackalign/builtin-return-1.c: Add an
  inline assembly read to make sure dummy is not optimized
  away by LTO.
From 9cee0251e69c1061a60caaf28da598eab2a1fbee Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Tue, 24 Nov 2015 13:50:15 +
Subject: [PATCH] Fixed test to be LTO proof.

---
 gcc/testsuite/gcc.dg/torture/stackalign/builtin-return-1.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/torture/stackalign/builtin-return-1.c b/gcc/testsuite/gcc.dg/torture/stackalign/builtin-return-1.c
index af017532aeb3878ef7ad717a2743661a87a56b7d..ec4fd8a9ef33a5e755bdb33e4faa41cab0f16a60 100644
--- a/gcc/testsuite/gcc.dg/torture/stackalign/builtin-return-1.c
+++ b/gcc/testsuite/gcc.dg/torture/stackalign/builtin-return-1.c
@@ -26,15 +26,13 @@ int bar(int n)
    STACK_ARGUMENTS_SIZE));
 }
 
-char *g;
-
 int main(void)
 {
   /* Allocate 64 bytes on the stack to make sure that __builtin_apply
  can read at least 64 bytes above the return address.  */
   char dummy[64];
 
-  g = dummy;
+  __asm__ ("" : : "" (dummy));
 
   if (bar(1) != 2)
 abort();
-- 
1.9.1



[PATCH][AARCH64]Fix for branch offsets over 1 MiB

2015-08-25 Thread Andre Vieira
Conditional branches have a maximum range of [-1048576, 1048572]. Any 
destination further away can not be reached by these.
To be able to have conditional branches in very large functions, we 
invert the condition and change the destination to jump over an 
unconditional branch to the original, far away, destination.


gcc/ChangeLog:
2015-08-07  Ramana Radhakrishnan  
Andre Vieira  

* config/aarch64/aarch64.md (*condjump): Handle functions > 1
Mib.
(*cb1): Idem.
(*tb1): Idem.
(*cb1): Idem.
* config/aarch64/iterators.md (inv_cb): New code attribute.
(inv_tb): Idem.
* config/aarch64/aarch64.c (aarch64_gen_far_branch): New.
* config/aarch64/aarch64-protos.h (aarch64_gen_far_branch): New.

gcc/testsuite/ChangeLog:
2015-08-07  Andre Vieira  

* gcc.target/aarch64/long-branch.c: New test.
From 9759c5a50c44b0421c7911014e63a6222dd9017d Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Fri, 14 Aug 2015 10:21:57 +0100
Subject: [PATCH] fix for far branches

---
 gcc/config/aarch64/aarch64-protos.h|   1 +
 gcc/config/aarch64/aarch64.c   |  23 +
 gcc/config/aarch64/aarch64.md  |  89 +++-
 gcc/config/aarch64/iterators.md|   6 +
 gcc/testsuite/gcc.target/aarch64/long_branch.c | 565 +
 5 files changed, 669 insertions(+), 15 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/long_branch.c

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 32b5d0958a6e0b2356874736f858f007fe68cdda..87a26deb6a0dbf13e25275baeebec21a37a42f41 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -316,6 +316,7 @@ unsigned aarch64_trampoline_size (void);
 void aarch64_asm_output_labelref (FILE *, const char *);
 void aarch64_cpu_cpp_builtins (cpp_reader *);
 void aarch64_elf_asm_named_section (const char *, unsigned, tree);
+const char * aarch64_gen_far_branch (rtx *, int, const char *, const char *);
 void aarch64_err_no_fpadvsimd (machine_mode, const char *);
 void aarch64_expand_epilogue (bool);
 void aarch64_expand_mov_immediate (rtx, rtx);
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 7159f5aca5df97f154b3e654f60af9136354f335..3b491a232d81892b6511bad84e4174d939fa1be7 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -587,6 +587,29 @@ static const char * const aarch64_condition_codes[] =
   "hi", "ls", "ge", "lt", "gt", "le", "al", "nv"
 };
 
+/* Generate code to enable conditional branches in functions over 1 MiB.  */
+const char *
+aarch64_gen_far_branch (rtx * operands, int pos_label, const char * dest,
+			const char * branch_format)
+{
+rtx_code_label * tmp_label = gen_label_rtx ();
+char label_buf[256];
+char buffer[128];
+ASM_GENERATE_INTERNAL_LABEL (label_buf, dest,
+ CODE_LABEL_NUMBER (tmp_label));
+const char *label_ptr = targetm.strip_name_encoding (label_buf);
+rtx dest_label = operands[pos_label];
+operands[pos_label] = tmp_label;
+
+snprintf (buffer, sizeof (buffer), "%s%s", branch_format, label_ptr);
+output_asm_insn (buffer, operands);
+
+snprintf (buffer, sizeof (buffer), "b\t%%l%d\n%s:", pos_label, label_ptr);
+operands[pos_label] = dest_label;
+output_asm_insn (buffer, operands);
+return "";
+}
+
 void
 aarch64_err_no_fpadvsimd (machine_mode mode, const char *msg)
 {
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 35255e91a95cdf20d52270470202f7499ba46bb2..74f6e3ec4bdcd076c2ad6d1431102aa50dfb5068 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -181,6 +181,13 @@
 	 (const_string "no")
 	] (const_string "yes")))
 
+;; Attribute that specifies whether we are dealing with a branch to a
+;; label that is far away, i.e. further away than the maximum/minimum
+;; representable in a signed 21-bits number.
+;; 0 :=: no
+;; 1 :=: yes
+(define_attr "far_branch" "" (const_int 0))
+
 ;; ---
 ;; Pipeline descriptions and scheduling
 ;; ---
@@ -308,8 +315,23 @@
 			   (label_ref (match_operand 2 "" ""))
 			   (pc)))]
   ""
-  "b%m0\\t%l2"
-  [(set_attr "type" "branch")]
+  {
+if (get_attr_length (insn) == 8)
+  return aarch64_gen_far_branch (operands, 2, "Lbcond", "b%M0\\t");
+else
+  return  "b%m0\\t%l2";
+  }
+  [(set_attr "type" "branch")
+   (set (attr "length")
+	(if_then_else (and (ge (minus (match_dup 2) (pc)) (const_int -1048576))
+			   (lt (minus (

Re: [PATCH][AARCH64]Fix for branch offsets over 1 MiB

2015-08-25 Thread Andre Vieira

On 25/08/15 10:52, Andrew Pinski wrote:

On Tue, Aug 25, 2015 at 5:50 PM, Andrew Pinski  wrote:

On Tue, Aug 25, 2015 at 5:37 PM, Andre Vieira
 wrote:

Conditional branches have a maximum range of [-1048576, 1048572]. Any
destination further away can not be reached by these.
To be able to have conditional branches in very large functions, we invert
the condition and change the destination to jump over an unconditional
branch to the original, far away, destination.

gcc/ChangeLog:
2015-08-07  Ramana Radhakrishnan  
 Andre Vieira  

 * config/aarch64/aarch64.md (*condjump): Handle functions > 1
 Mib.
 (*cb1): Idem.
 (*tb1): Idem.
 (*cb1): Idem.
 * config/aarch64/iterators.md (inv_cb): New code attribute.
 (inv_tb): Idem.
 * config/aarch64/aarch64.c (aarch64_gen_far_branch): New.
 * config/aarch64/aarch64-protos.h (aarch64_gen_far_branch): New.

gcc/testsuite/ChangeLog:
2015-08-07  Andre Vieira  

 * gcc.target/aarch64/long-branch.c: New test.


Just a few comments about the testcase.  You could improve the size
(on disk) of the testcase by using the preprocessor some more:
Something like:
#define CASE_ENTRY2 (x) CASE_ENTRY ((x)) CASE_ENTRY ((x)+1)
#define CASE_ENTRY4 (x) CASE_ENTRY2 ((x)) CASE_ENTRY2 ((x)+2+1)
#define CASE_ENTRY8 (x) CASE_ENTRY4 ((x)) CASE_ENTRY4 ((x)+4+1)
#define CASE_ENTRY16 (x) CASE_ENTRY8 ((x)) CASE_ENTRY8 ((x)+8+1)
#define CASE_ENTRY32 (x) CASE_ENTRY16 ((x)) CASE_ENTRY16 ((x)+16)
#define CASE_ENTRY64 (x) CASE_ENTRY32 ((x)) CASE_ENTRY32 ((x)+32+1)
#define CASE_ENTRY128 (x) CASE_ENTRY64 ((x)) CASE_ENTRY16 ((x)+64+1)
#define CASE_ENTRY256 (x) CASE_ENTRY128 ((x)) CASE_ENTRY128 ((x)+128+1)



I do have an off by one error but you should get the idea.  Basically
instead of 200 lines, we only have 9 lines (log2(256) == 8).

Thanks,
Andrew



And then use
CASE_ENTRY256 (1)

You can do the same trick to reduce the size of CASE_ENTRY too.

Thanks,
Andrew Pinski




Conditional branches have a maximum range of [-1048576, 1048572]. Any 
destination further away can not be reached by these.
To be able to have conditional branches in very large functions, we 
invert the condition and change the destination to jump over an 
unconditional branch to the original, far away, destination.


gcc/ChangeLog:
2015-08-07  Ramana Radhakrishnan  
    Andre Vieira  

* config/aarch64/aarch64.md (*condjump): Handle functions > 1 Mib.
(*cb1): Likewise.
(*tb1): Likewise.
(*cb1): Likewise.
* config/aarch64/iterators.md (inv_cb): New code attribute.
(inv_tb): Likewise.
* config/aarch64/aarch64.c (aarch64_gen_far_branch): New.
* config/aarch64/aarch64-protos.h (aarch64_gen_far_branch): New.

gcc/testsuite/ChangeLog:
2015-08-07  Andre Vieira  

* gcc.target/aarch64/long_branch_1.c: New test.
From e34022ecd6f914b5a713594ca5b21b33929a3a1f Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Tue, 25 Aug 2015 13:12:11 +0100
Subject: [PATCH] fix for far branches

---
 gcc/config/aarch64/aarch64-protos.h  |  1 +
 gcc/config/aarch64/aarch64.c | 23 ++
 gcc/config/aarch64/aarch64.md| 89 +++
 gcc/config/aarch64/iterators.md  |  6 ++
 gcc/testsuite/gcc.target/aarch64/long_branch_1.c | 91 
 5 files changed, 195 insertions(+), 15 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/long_branch_1.c

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 4b3cbedbd0a5fa186619e05c0c0b400c8257b1c0..9afb7ef9afadf2b3dfeb24db230829344201deba 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -322,6 +322,7 @@ unsigned aarch64_trampoline_size (void);
 void aarch64_asm_output_labelref (FILE *, const char *);
 void aarch64_cpu_cpp_builtins (cpp_reader *);
 void aarch64_elf_asm_named_section (const char *, unsigned, tree);
+const char * aarch64_gen_far_branch (rtx *, int, const char *, const char *);
 void aarch64_err_no_fpadvsimd (machine_mode, const char *);
 void aarch64_expand_epilogue (bool);
 void aarch64_expand_mov_immediate (rtx, rtx);
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 87bbf6e7988e4ef796c09075ee584822483cbbce..188d0dd555d3d765aff7e78623a4e938497bec3f 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -586,6 +586,29 @@ static const char * const aarch64_condition_codes[] =
   "hi", "ls", "ge", "lt", "gt", "le", "al", "nv"
 };
 
+/* Generate code to enable conditional branches in functions over 1 MiB.  */
+const char *
+aarch64_gen_far_branch (rtx * operands, int pos_label, const char * dest,
+			const char * branch_format)
+{
+rtx_code_label * tmp_label = gen_label_rtx ();
+char la

[PATCH][GCC] Algorithmic optimization in match and simplify

2015-08-28 Thread Andre Vieira

Two new algorithmic optimisations:
  1.((X op0 C0) op1 C1) op2 C2)
with op0 = {&, >>, <<}, op1 = {|,^}, op2 = {|,^} and op1 != op2
zero_mask has 1's for all bits that are sure to be 0 in (X op0 C0)
and 0's otherwise.
if (op1 == '^') C1 &= ~C2 (Only changed if actually emitted)
if ((C1 & ~zero_mask) == 0) then emit (X op0 C0) op2 (C1 op2 C2)
if ((C2 & ~zero_mask) == 0) then emit (X op0 C0) op1 (C1 op2 C2)
  2. (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>} C1)


This patch does two algorithmic optimisations that target patterns like:
(((x << 24) | 0x00FF) ^ 0xFF00) and ((x ^ 0x4002) >> 1) | 
0x8000.


The transformation uses the knowledge of which bits are zero after 
operations like (X {&,<<,(unsigned)>>}) to combine constants, reducing 
run-time operations.
The two examples above would be transformed into (X << 24) ^ 0x 
and (X >> 1) ^ 0xa001 respectively.



gcc/ChangeLog:

2015-08-03  Andre Vieira  

  * match.pd: Added new patterns:
((X {&,<<,>>} C0) {|,^} C1) {^,|} C2)
(X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>} C1)

gcc/testsuite/ChangeLog:

2015-08-03  Andre Vieira  

  * gcc.dg/tree-ssa/forwprop-33.c: New test.
From 15f86df5b3561edf26ae79cedbe160fd46596fd9 Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Wed, 26 Aug 2015 16:27:31 +0100
Subject: [PATCH] algorithmic optimization

---
 gcc/match.pd| 70 +
 gcc/testsuite/gcc.dg/tree-ssa/forwprop-33.c | 42 +
 2 files changed, 112 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/forwprop-33.c

diff --git a/gcc/match.pd b/gcc/match.pd
index eb0ba9d10a9b8ca66c23c56da0678477379daf80..3d9a8f52713e8dfb2189aad76bce709c924fa286 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -708,6 +708,76 @@ along with GCC; see the file COPYING3.  If not see
   && tree_nop_conversion_p (type, TREE_TYPE (@1)))
   (convert (bit_and (bit_not @1) @0
 
+/* (X bit_op C0) rshift C1 -> (X rshift C0) bit_op (C0 rshift C1) */
+(for bit_op (bit_ior bit_xor bit_and)
+(simplify
+ (rshift (bit_op:c @0 INTEGER_CST@1) INTEGER_CST@2)
+ (bit_op
+  (rshift @0 @2)
+  { wide_int_to_tree (type, wi::rshift (@1, @2, TYPE_SIGN (type))); })))
+
+/* (X bit_op C0) lshift C1 -> (X lshift C0) bit_op (C0 lshift C1) */
+(for bit_op (bit_ior bit_xor bit_and)
+(simplify
+ (lshift (bit_op:c @0 INTEGER_CST@1) INTEGER_CST@2)
+ (bit_op
+  (lshift @0 @2)
+  { wide_int_to_tree (type, wi::lshift (@1, @2)); })))
+
+
+/* ((X op0 C0) op1 C1) op2 C2)
+with op0 = {&, >>, <<}, op1 = {|,^}, op2 = {|,^} and op1 != op2
+zero_mask has 1's for all bits that are sure to be 0 in (X op0 C0)
+and 0's otherwise.
+if (op1 == '^') C1 &= ~C2;
+if ((C1 & ~zero_mask) == 0) then emit (X op0 C0) op2 (C1 op2 C2)
+if ((C2 & ~zero_mask) == 0) then emit (X op0 C0) op1 (C1 op2 C2)
+*/
+(for op0 (rshift rshift lshift lshift bit_and bit_and)
+ op1 (bit_ior bit_xor bit_ior bit_xor bit_ior bit_xor)
+ op2 (bit_xor bit_ior bit_xor bit_ior bit_xor bit_ior)
+(simplify
+ (op2:c
+  (op1:c
+   (op0 @0 INTEGER_CST@1) INTEGER_CST@2) INTEGER_CST@3)
+ (with
+  {
+unsigned int prec = TYPE_PRECISION (type);
+wide_int zero_mask = wi::zero (prec);
+wide_int C0 = wide_int_storage (@1);
+wide_int C1 = wide_int_storage (@2);
+wide_int C2 = wide_int_storage (@3);
+wide_int cst_emit;
+
+if (op0 == BIT_AND_EXPR)
+  {
+	zero_mask = wide_int_storage (wi::neg (@1));
+  }
+else if (op0 == LSHIFT_EXPR && wi::fits_uhwi_p (@1))
+  {
+	zero_mask = wide_int_storage (wi::mask (C0.to_uhwi (), false, prec));
+  }
+else if (op0 == RSHIFT_EXPR && TYPE_UNSIGNED (type) && wi::fits_uhwi_p (@1))
+  {
+	unsigned HOST_WIDE_INT m = prec - C0.to_uhwi ();
+	zero_mask = wide_int_storage (wi::mask (m, true, prec));
+  }
+
+if (op1 == BIT_XOR_EXPR)
+  {
+	C1 = wi::bit_and_not (C1,C2);
+	cst_emit = wi::bit_or (C1, C2);
+  }
+else
+  {
+	cst_emit = wi::bit_xor (C1, C2);
+  }
+  }
+  (if ((C1 & ~zero_mask) == 0)
+   (op2 (op0 @0 @1) { wide_int_to_tree (type, cst_emit); })
+   (if ((C2 & ~zero_mask) == 0)
+(op1 (op0 @0 @1) { wide_int_to_tree (type, cst_emit); }))
+
 /* Associate (p +p off1) +p off2 as (p +p (off1 + off2)).  */
 (simplify
   (pointer_plus (pointer_plus:s @0 @1) @3)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/forwprop-33.c b/gcc/testsuite/gcc.dg/tree-ssa/forwprop-33.c
new file mode 100644
index ..984d8b37a01defe0e6852070a7dfa7ace5027c51
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/forwprop-33.c
@@ -0,0 +1,42 @@
+/* { dg-do co

Re: [PATCH][GCC] Algorithmic optimization in match and simplify

2015-09-01 Thread Andre Vieira

Hi Marc,

On 28/08/15 19:07, Marc Glisse wrote:

(not a review, I haven't even read the whole patch)

On Fri, 28 Aug 2015, Andre Vieira wrote:


2015-08-03  Andre Vieira  

  * match.pd: Added new patterns:
((X {&,<<,>>} C0) {|,^} C1) {^,|} C2)
(X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>} C1)


+(for op0 (rshift rshift lshift lshift bit_and bit_and)
+ op1 (bit_ior bit_xor bit_ior bit_xor bit_ior bit_xor)
+ op2 (bit_xor bit_ior bit_xor bit_ior bit_xor bit_ior)

You can nest for-loops, it seems clearer as:
(for op0 (rshift lshift bit_and)
   (for op1 (bit_ior bit_xor)
op2 (bit_xor bit_ior)


Will do, thank you for pointing it out.


+(simplify
+ (op2:c
+  (op1:c
+   (op0 @0 INTEGER_CST@1) INTEGER_CST@2) INTEGER_CST@3)

I suspect you will want more :s (single_use) and less :c (canonicalization
should put constants in second position).

I can't find the definition of :s (single_use). GCC internals do point 
out that canonicalization does put constants in the second position, 
didnt see that first. Thank you for pointing it out.



+   C1 = wi::bit_and_not (C1,C2);

Space after ','.


Will do.


Having wide_int_storage in many places is surprising, I can't find similar
code anywhere else in gcc.




I tried looking for examples of something similar, I think I ended up 
using wide_int because I was able to convert easily to and from it and 
it has the "mask" and "wide_int_to_tree" functions. I welcome any 
suggestions on what I should be using here for integer constant 
transformations and comparisons.


BR,
Andre



Re: Location of "dg-final" directives? (was Re: [PATCH][GCC] Algorithmic optimization in match and simplify)

2015-09-02 Thread Andre Vieira



On 01/09/15 17:54, Marek Polacek wrote:

On Tue, Sep 01, 2015 at 12:50:27PM -0400, David Malcolm wrote:

I can't comment on the patch itself, but I noticed that in the testsuite
addition, you've gathered all the "dg-final" clauses at the end.

I think that this is consistent with existing practice in gcc, but
AFAIK, the dg-final clauses can appear anywhere in the file.

Would it make test cases more readable if we put the dg-final clauses
next to the code that they relate to?  Or, at least after the function
that they relate to?

For example:

   unsigned short
   foo (unsigned short a)
   {
 a ^= 0x4002;
 a >>= 1;
 a |= 0x8000; /* Simplify to ((a >> 1) ^ 0xa001).  */
 return a;
   }
   /* { dg-final { scan-tree-dump "\\^ 40961" "forwprop1" } } */

   unsigned short
   bar (unsigned short a)
   {
   /* snip */ /* Simplify to ((a << 1) | 0x8005).  */
   }
   /* { dg-final { scan-tree-dump "\\| 32773" "forwprop1" } } */

   ... etc ...

I think this may be more readable than the "group them all at the end of
the file approach", especially with testcases with many files and
dg-final directives.


Yeah, it's probably somewhat more readable.  Same for dg-output.  E.g. some
ubsan tests already use dg-outputs after functions they relate to.

Marek



If no one objects Ill make those changes too. Sounds reasonable to me.

Andre



Re: [PATCH v2][GCC] Algorithmic optimization in match and simplify

2015-09-03 Thread Andre Vieira

On 01/09/15 15:01, Richard Biener wrote:

On Tue, Sep 1, 2015 at 3:40 PM, Andre Vieira
 wrote:

Hi Marc,

On 28/08/15 19:07, Marc Glisse wrote:


(not a review, I haven't even read the whole patch)

On Fri, 28 Aug 2015, Andre Vieira wrote:


2015-08-03  Andre Vieira  

   * match.pd: Added new patterns:
 ((X {&,<<,>>} C0) {|,^} C1) {^,|} C2)
 (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>} C1)



+(for op0 (rshift rshift lshift lshift bit_and bit_and)
+ op1 (bit_ior bit_xor bit_ior bit_xor bit_ior bit_xor)
+ op2 (bit_xor bit_ior bit_xor bit_ior bit_xor bit_ior)

You can nest for-loops, it seems clearer as:
(for op0 (rshift lshift bit_and)
(for op1 (bit_ior bit_xor)
 op2 (bit_xor bit_ior)



Will do, thank you for pointing it out.



+(simplify
+ (op2:c
+  (op1:c
+   (op0 @0 INTEGER_CST@1) INTEGER_CST@2) INTEGER_CST@3)

I suspect you will want more :s (single_use) and less :c (canonicalization
should put constants in second position).


I can't find the definition of :s (single_use).


Sorry for that - didn't get along updating it yet :/  It restricts matching to
sub-expressions that have a single-use.  So

+  a &= 0xd123;
+  unsigned short tem = a ^ 0x6040;
+  a = tem | 0xc031; /* Simplify _not_ to ((a & 0xd123) | 0xe071).  */
... use of tem ...

we shouldn't do the simplifcation here because the expression
(a & 0x123) ^ 0x6040 is kept live by 'tem'.


GCC internals do point out
that canonicalization does put constants in the second position, didnt see
that first. Thank you for pointing it out.


+   C1 = wi::bit_and_not (C1,C2);

Space after ','.


Will do.


Having wide_int_storage in many places is surprising, I can't find similar
code anywhere else in gcc.




I tried looking for examples of something similar, I think I ended up using
wide_int because I was able to convert easily to and from it and it has the
"mask" and "wide_int_to_tree" functions. I welcome any suggestions on what I
should be using here for integer constant transformations and comparisons.


Using wide-ints is fine, but you shouldn't need 'wide_int_storage'
constructors - those
are indeed odd.  Is it just for the initializers of wide-ints?

+wide_int zero_mask = wi::zero (prec);
+wide_int C0 = wide_int_storage (@1);
+wide_int C1 = wide_int_storage (@2);
+wide_int C2 = wide_int_storage (@3);
...
+   zero_mask = wide_int_storage (wi::mask (C0.to_uhwi (), false, prec));

tree_to_uhwi (@1) should do the trick as well

+   C1 = wi::bit_and_not (C1,C2);
+   cst_emit = wi::bit_or (C1, C2);

the ops should be replacable with @2 and @3, the store to C1 obviously not
(but you can use a tree temporary and use wide_int_to_tree here to avoid
the back-and-forth for the case where C1 is not assigned to).

Note that transforms only doing association are prone to endless recursion
in case some other pattern does the reverse op...

Richard.



BR,
Andre




Thank you for all the comments, see reworked version:

Two new algorithmic optimisations:
  1.((X op0 C0) op1 C1) op2 C2)
with op0 = {&, >>, <<}, op1 = {|,^}, op2 = {|,^} and op1 != op2
zero_mask has 1's for all bits that are sure to be 0 in (X op0 C0)
and 0's otherwise.
if (op1 == '^') C1 &= ~C2 (Only changed if actually emitted)
if ((C1 & ~zero_mask) == 0) then emit (X op0 C0) op2 (C1 op2 C2)
if ((C2 & ~zero_mask) == 0) then emit (X op0 C0) op1 (C1 op2 C2)
  2. (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>} C1)


This patch does two algorithmic optimisations that target patterns like:
(((x << 24) | 0x00FF) ^ 0xFF00) and ((x ^ 0x4002) >> 1) | 
0x8000.


The transformation uses the knowledge of which bits are zero after 
operations like (X {&,<<,(unsigned)>>}) to combine constants, reducing 
run-time operations.
The two examples above would be transformed into (X << 24) ^ 0x 
and (X >> 1) ^ 0xa001 respectively.


The second transformation enables more applications of the first. Also 
some targets may benefit from delaying shift operations. I am aware that 
such an optimization, in combination with one or more optimizations that 
cause the reverse transformation, may lead to an infinite loop. Though 
such behavior has not been detected during regression testing and 
bootstrapping on aarch64.


gcc/ChangeLog:

2015-08-03  Andre Vieira  

  * match.pd: Added new patterns:
((X {&,<<,>>} C0) {|,^} C1) {^,|} C2)
(X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>} C1)

gcc/testsuite/ChangeLog:

2015-08-03  Andre Vieira  
Hale Wang  

  * gcc.dg/tree-ssa/forwprop-33.c: New test.
From 3

[gcc-5-branch][PATCH][AARCH64]Fix for branch offsets over 1 MiB

2015-09-11 Thread Andre Vieira
Conditional branches have a maximum range of [-1048576, 1048572]. Any 
destination further away can not be reached by these.
To be able to have conditional branches in very large functions, we 
invert the condition and change the destination to jump over an 
unconditional branch to the original, far away, destination.


This patch backports the fix from trunk to the gcc-5-branch.
The original patch is at:
https://gcc.gnu.org/ml/gcc-patches/2015-08/msg01493.html

gcc/ChangeLog:
2015-09-09  Andre Vieira  

  Backport from mainline:
  2015-08-27  Ramana Radhakrishnan  
  Andre Vieira  

  * config/aarch64/aarch64.md (*condjump): Handle functions > 1 MiB.
  (*cb1): Likewise.
  (*tb1): Likewise.
  (*cb1): Likewise.
  * config/aarch64/iterators.md (inv_cb): New code attribute.
  (inv_tb): Likewise.
  * config/aarch64/aarch64.c (aarch64_gen_far_branch): New.
  * config/aarch64/aarch64-protos.h (aarch64_gen_far_branch): New.


gcc/testsuite/ChangeLog:
2015-09-09  Andre Vieira  

  Backport from mainline:
  2015-08-27  Andre Vieira  

  * gcc.target/aarch64/long_branch_1.c: New test.
From 5b9f35c3a6dff67328e66e82f33ef3dc732ff5f7 Mon Sep 17 00:00:00 2001
From: Andre Simoes Dias Vieira 
Date: Tue, 8 Sep 2015 16:51:14 +0100
Subject: [PATCH] Backport fix for far branches

---
 gcc/config/aarch64/aarch64-protos.h  |  1 +
 gcc/config/aarch64/aarch64.c | 23 ++
 gcc/config/aarch64/aarch64.md| 89 +++
 gcc/config/aarch64/iterators.md  |  6 ++
 gcc/testsuite/gcc.target/aarch64/long_branch_1.c | 91 
 5 files changed, 195 insertions(+), 15 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/long_branch_1.c

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 59c5824f894cf5dafe93a996180056696518feb4..8669694bfc33cc040baeab5aaddc7a0575071add 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -258,6 +258,7 @@ void aarch64_init_expanders (void);
 void aarch64_print_operand (FILE *, rtx, char);
 void aarch64_print_operand_address (FILE *, rtx);
 void aarch64_emit_call_insn (rtx);
+const char * aarch64_gen_far_branch (rtx *, int, const char *, const char *);
 
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index cd20b48d7fd300ab496e67cae0d6e503ac305409..c2e4252cf689a4d7b28890743095536f3a390338 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1029,6 +1029,29 @@ aarch64_split_128bit_move_p (rtx dst, rtx src)
 
 /* Split a complex SIMD combine.  */
 
+/* Generate code to enable conditional branches in functions over 1 MiB.  */
+const char *
+aarch64_gen_far_branch (rtx * operands, int pos_label, const char * dest,
+			const char * branch_format)
+{
+rtx_code_label * tmp_label = gen_label_rtx ();
+char label_buf[256];
+char buffer[128];
+ASM_GENERATE_INTERNAL_LABEL (label_buf, dest,
+ CODE_LABEL_NUMBER (tmp_label));
+const char *label_ptr = targetm.strip_name_encoding (label_buf);
+rtx dest_label = operands[pos_label];
+operands[pos_label] = tmp_label;
+
+snprintf (buffer, sizeof (buffer), "%s%s", branch_format, label_ptr);
+output_asm_insn (buffer, operands);
+
+snprintf (buffer, sizeof (buffer), "b\t%%l%d\n%s:", pos_label, label_ptr);
+operands[pos_label] = dest_label;
+output_asm_insn (buffer, operands);
+return "";
+}
+
 void
 aarch64_split_simd_combine (rtx dst, rtx src1, rtx src2)
 {
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 4e7a6d19ac2ff8067912052697ca7a1fc403f910..c3c3beb75b4d4486d5b235e31b4db7bd213c7c7b 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -179,6 +179,13 @@
 	 (const_string "no")
 	] (const_string "yes")))
 
+;; Attribute that specifies whether we are dealing with a branch to a
+;; label that is far away, i.e. further away than the maximum/minimum
+;; representable in a signed 21-bits number.
+;; 0 :=: no
+;; 1 :=: yes
+(define_attr "far_branch" "" (const_int 0))
+
 ;; ---
 ;; Pipeline descriptions and scheduling
 ;; ---
@@ -306,8 +313,23 @@
 			   (label_ref (match_operand 2 "" ""))
 			   (pc)))]
   ""
-  "b%m0\\t%l2"
-  [(set_attr "type" "branch")]
+  {
+if (get_attr_length (insn) == 8)
+  return aarch64_gen_far_branch (operands, 2, "Lbcond", "b%M0\\t");
+else
+  return  "b%m0\\t%l2";
+  }
+  [(set_attr "type" "branch")
+   (set (attr "length")
+	(

Re: [RFC][PATCH, ARM 1/8] Add support for ARMv8-M's Security Extensions flag and intrinsics

2016-01-05 Thread Andre Vieira

On 31/12/15 20:54, Joseph Myers wrote:

On Sat, 26 Dec 2015, Thomas Preud'homme wrote:


+#define CMSE_TT_ASM(flags) \
+{ \
+  cmse_address_info_t result; \
+   __asm__ ("tt" # flags " %0,%1" \
+  : "=r"(result) \
+  : "r"(p) \
+  : "memory"); \
+  return result; \


Are the identifiers "result" and "p" really meant to be reserved by this
header (so that users can't have macros with those names before including
it), or should they actually be __result and __p (and likewise for any
other identifiers in this file not specified as reserved)?


+__extension__ void *
+cmse_check_address_range (void *p, size_t size, int flags);


Are "size" and "flags" really meant to be reserved?


+@item -mcmse
+@opindex mcmse
+Generate secure code as per ARMv8-M Security Extensions.


I think you also need a section in extend.texi much like the existing ACLE
section, to describe support for this as a language extension.



I'll change all non-reserved and 'not-ment-for-export' identifiers to be 
preceded by '__' and Ill also look into adding a section for ARMv8-M 
Security Extensions (CMSE) to extend.texi.


Thank you for your feedback.

BR,
Andre


Re: [Ping^2][PATCH][GCC][ARM] testcase memset-inline-10.c uses -mfloat-abi=hard but does not check whether target supports it

2016-01-05 Thread Andre Vieira

On 27/11/15 14:28, Andre Vieira wrote:

On 12/11/15 15:16, Andre Vieira wrote:

On 12/11/15 15:08, Andre Vieira wrote:

Hi,

   This patch changes the memset-inline-10.c testcase to make sure that
it is only compiled for ARM targets that support -mfloat-abi=hard using
the fact that all non-thumb1 targets do.

   This is correct because all targets for which -mthumb causes the
compiler to use thumb2 will support the generation of FP instructions.

   Tested by running regressions for this testcase for various ARM
targets.

   Is this OK to commit?

   Thanks,
   Andre Vieira

gcc/testsuite/ChangeLog:
2015-11-06  Andre Vieira  

 * gcc.target/arm/memset-inline-10.c: Added
 dg-require-effective-target arm_thumb2_ok.


Now with attachment, sorry about that.

Cheers,
Andre


Ping.



Ping.


[PATCH 1/2] aarch64: Split FCMA feature bit from Armv8.3-A

2024-10-02 Thread Andre Vieira

This patch splits out FCMA as a feature from Armv8.3-A and adds it as a separate
feature bit which now controls 'TARGET_COMPLEX'.

gcc/ChangeLog:

* config/aarch64/aarch64-arches.def (FCMA): New feature bit, can not be
used as an extension in the command-line.
* config/aarch64/aarch64.h (TARGET_COMPLEX): Use FCMA feature bit
rather than ARMV8_3.
---
 gcc/config/aarch64/aarch64-arches.def| 2 +-
 gcc/config/aarch64/aarch64-option-extensions.def | 1 +
 gcc/config/aarch64/aarch64.h | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-arches.def b/gcc/config/aarch64/aarch64-arches.def
index 4634b272e28..fadf9c36b03 100644
--- a/gcc/config/aarch64/aarch64-arches.def
+++ b/gcc/config/aarch64/aarch64-arches.def
@@ -33,7 +33,7 @@
 AARCH64_ARCH("armv8-a",   generic_armv8_a,   V8A,   8,  (SIMD))
 AARCH64_ARCH("armv8.1-a", generic_armv8_a,   V8_1A, 8,  (V8A, LSE, CRC, RDMA))
 AARCH64_ARCH("armv8.2-a", generic_armv8_a,   V8_2A, 8,  (V8_1A))
-AARCH64_ARCH("armv8.3-a", generic_armv8_a,   V8_3A, 8,  (V8_2A, PAUTH, RCPC))
+AARCH64_ARCH("armv8.3-a", generic_armv8_a,   V8_3A, 8,  (V8_2A, PAUTH, RCPC, FCMA))
 AARCH64_ARCH("armv8.4-a", generic_armv8_a,   V8_4A, 8,  (V8_3A, F16FML, DOTPROD, FLAGM))
 AARCH64_ARCH("armv8.5-a", generic_armv8_a,   V8_5A, 8,  (V8_4A, SB, SSBS, PREDRES))
 AARCH64_ARCH("armv8.6-a", generic_armv8_a,   V8_6A, 8,  (V8_5A, I8MM, BF16))
diff --git a/gcc/config/aarch64/aarch64-option-extensions.def b/gcc/config/aarch64/aarch64-option-extensions.def
index 6998627f377..4732c20ec96 100644
--- a/gcc/config/aarch64/aarch64-option-extensions.def
+++ b/gcc/config/aarch64/aarch64-option-extensions.def
@@ -193,6 +193,7 @@ AARCH64_OPT_EXTENSION("sve2-sm4", SVE2_SM4, (SVE2, SM4), (), (), "svesm4")
 AARCH64_FMV_FEATURE("sve2-sm4", SVE_SM4, (SVE2_SM4))
 
 AARCH64_OPT_FMV_EXTENSION("sme", SME, (BF16, SVE2), (), (), "sme")
+AARCH64_OPT_EXTENSION("", FCMA, (), (), (), "fcma")
 
 AARCH64_OPT_EXTENSION("memtag", MEMTAG, (), (), (), "")
 
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index a99e7bb6c47..c0ad305e324 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -362,7 +362,7 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE ATTRIBUTE_UNUSED
 #define TARGET_JSCVT	(TARGET_FLOAT && TARGET_ARMV8_3)
 
 /* Armv8.3-a Complex number extension to AdvSIMD extensions.  */
-#define TARGET_COMPLEX (TARGET_SIMD && TARGET_ARMV8_3)
+#define TARGET_COMPLEX (TARGET_SIMD && AARCH64_HAVE_ISA (FCMA))
 
 /* Floating-point rounding instructions from Armv8.5-a.  */
 #define TARGET_FRINT (AARCH64_HAVE_ISA (V8_5A) && TARGET_FLOAT)


[PATCH 2/2] aarch64: remove SVE2 requirement from SME and diagnose it as unsupported

2024-10-02 Thread Andre Vieira

As per the AArch64 ISA FEAT_SME does not require FEAT_SVE2, so we are removing
that false dependency in GCC.  However, we chose for now to not support this
combination of features and will diagnose the combination of FEAT_SME without
FEAT_SVE2 as unsupported by GCC.  We may choose to support this in the future.

gcc/ChangeLog:

* config/aarch64/aarch64-arches.def (SME): Remove SVE2 as prerequisite
and add in FCMA and F16FML.
* config/aarch64/aarch64.cc (aarch64_override_options): Diagnose use of
SME without SVE2.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sve/acle/general-c/binary_int_opt_single_n_2.c:
Pass +sve2 to existing +sme pragma.
* gcc.target/aarch64/sve/acle/general-c/binary_opt_single_n_2.c:
Likewise.
* gcc.target/aarch64/sve/acle/general-c/binary_single_1.c: Likewise.
* gcc.target/aarch64/sve/acle/general-c/binaryxn_2.c: Likewise.
* gcc.target/aarch64/sve/acle/general-c/clamp_1.c: Likewise.
* gcc.target/aarch64/sve/acle/general-c/compare_scalar_count_1.c:
Likewise.
* gcc.target/aarch64/sve/acle/general-c/shift_right_imm_narrowxn_1.c:
Likewise.
* gcc.target/aarch64/sve/acle/general-c/storexn_1.c: Likewise.
* gcc.target/aarch64/sve/acle/general-c/ternary_qq_or_011_lane_1.c:
Likewise.
* gcc.target/aarch64/sve/acle/general-c/unary_convertxn_1.c: Likewise.
* gcc.target/aarch64/sve/acle/general-c/unaryxn_1.c: Likewise.
---
 gcc/config/aarch64/aarch64-option-extensions.def  | 3 ++-
 gcc/config/aarch64/aarch64.cc | 4 
 .../aarch64/sve/acle/general-c/binary_int_opt_single_n_2.c| 2 +-
 .../aarch64/sve/acle/general-c/binary_opt_single_n_2.c| 2 +-
 .../gcc.target/aarch64/sve/acle/general-c/binary_single_1.c   | 2 +-
 .../gcc.target/aarch64/sve/acle/general-c/binaryxn_2.c| 2 +-
 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/clamp_1.c | 2 +-
 .../aarch64/sve/acle/general-c/compare_scalar_count_1.c   | 2 +-
 .../aarch64/sve/acle/general-c/shift_right_imm_narrowxn_1.c   | 2 +-
 .../gcc.target/aarch64/sve/acle/general-c/storexn_1.c | 2 +-
 .../aarch64/sve/acle/general-c/ternary_qq_or_011_lane_1.c | 2 +-
 .../gcc.target/aarch64/sve/acle/general-c/unary_convertxn_1.c | 2 +-
 .../gcc.target/aarch64/sve/acle/general-c/unaryxn_1.c | 2 +-
 13 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-option-extensions.def b/gcc/config/aarch64/aarch64-option-extensions.def
index 4732c20ec96..e38a4ab3f78 100644
--- a/gcc/config/aarch64/aarch64-option-extensions.def
+++ b/gcc/config/aarch64/aarch64-option-extensions.def
@@ -192,9 +192,10 @@ AARCH64_OPT_EXTENSION("sve2-sm4", SVE2_SM4, (SVE2, SM4), (), (), "svesm4")
 
 AARCH64_FMV_FEATURE("sve2-sm4", SVE_SM4, (SVE2_SM4))
 
-AARCH64_OPT_FMV_EXTENSION("sme", SME, (BF16, SVE2), (), (), "sme")
 AARCH64_OPT_EXTENSION("", FCMA, (), (), (), "fcma")
 
+AARCH64_OPT_FMV_EXTENSION("sme", SME, (BF16, FCMA, F16FML), (), (), "sme")
+
 AARCH64_OPT_EXTENSION("memtag", MEMTAG, (), (), (), "")
 
 AARCH64_OPT_FMV_EXTENSION("sb", SB, (), (), (), "sb")
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 68913beaee2..bc2023da180 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -18998,6 +18998,10 @@ aarch64_override_options (void)
  while processing functions with potential target attributes.  */
   target_option_default_node = target_option_current_node
 = build_target_option_node (&global_options, &global_options_set);
+
+  if (TARGET_SME && !TARGET_SVE2)
+warning (0, "this gcc version does not guarantee full support for +sme"
+		" without +sve2");
 }
 
 /* Implement targetm.override_options_after_change.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/binary_int_opt_single_n_2.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/binary_int_opt_single_n_2.c
index 976d5af7f23..7150d37a2aa 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/binary_int_opt_single_n_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/binary_int_opt_single_n_2.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 
-#pragma GCC target "+sme2"
+#pragma GCC target "+sve2+sme2"
 
 #include 
 
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/binary_opt_single_n_2.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/binary_opt_single_n_2.c
index 5cc8a4c5c50..2823264edbd 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/binary_opt_single_n_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/binary_opt_single_n_2.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 
-#pragma GCC target "+sme2"
+#pragma GCC target "+sve2+sme2"
 
 #include 
 
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/binary_single_1.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/binary_single_1

[PATCH 0/2] aarch64: remove SVE2 requirement from SME and diagnose it as unsupported

2024-10-02 Thread Andre Vieira
This patch series removes the requirement of SVE2 for SME, so when a user
passes +sme, SVE2 is not enabled as a result of that.
We do this to be compliant with the ISA and behave in a compatible manner to
other toolchains, to prevent unexpected behavior when switching between them.

However, for the time being we diagnose the use of SME without SVE2 as
unsupported, we suspect that the backend correctly enables and disables the
right instructions given the options, but we believe that for certain codegen
there are assumptions that SVE & SVE2 is present when using SME.  Before we
fully support this combination we should investigate these.

The patch series also refactors the FCMA/COMPNUM/TARGET_COMPLEX feature to
separate it from Armv8.3-A feature set.

Andre Vieira (2)
 aarch64: Split FCMA feature bit from Armv8.3-A
 aarch64: remove SVE2 requirement from SME and diagnose it as unsupported

Regression tested on aarch64-none-linux-gnu.

OK for trunk?

Andre Vieira (2):
  aarch64: Split FCMA feature bit from Armv8.3-A
  aarch64: remove SVE2 requirement from SME and diagnose it as unsupported

 gcc/config/aarch64/aarch64-arches.def | 2 +-
 gcc/config/aarch64/aarch64-option-extensions.def  | 4 +++-
 gcc/config/aarch64/aarch64.cc | 4 
 gcc/config/aarch64/aarch64.h  | 2 +-
 .../aarch64/sve/acle/general-c/binary_int_opt_single_n_2.c| 2 +-
 .../aarch64/sve/acle/general-c/binary_opt_single_n_2.c| 2 +-
 .../gcc.target/aarch64/sve/acle/general-c/binary_single_1.c   | 2 +-
 .../gcc.target/aarch64/sve/acle/general-c/binaryxn_2.c| 2 +-
 gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/clamp_1.c | 2 +-
 .../aarch64/sve/acle/general-c/compare_scalar_count_1.c   | 2 +-
 .../aarch64/sve/acle/general-c/shift_right_imm_narrowxn_1.c   | 2 +-
 .../gcc.target/aarch64/sve/acle/general-c/storexn_1.c | 2 +-
 .../aarch64/sve/acle/general-c/ternary_qq_or_011_lane_1.c | 2 +-
 .../gcc.target/aarch64/sve/acle/general-c/unary_convertxn_1.c | 2 +-
 .../gcc.target/aarch64/sve/acle/general-c/unaryxn_1.c | 2 +-
 15 files changed, 20 insertions(+), 14 deletions(-)

-- 
2.25.1



[PATCH 1/3] arm, mve: Fix scan-assembler for test7 in dlstp-compile-asm-2.c

2024-11-29 Thread Andre Vieira

After the changes to the vctp intrinsic codegen changed slightly, where we now
unfortunately seem to be generating unneeded moves and extends of the mask.
These are however not incorrect and we don't have a fix for the unneeded
codegen right now, so changing the testcase to accept them so we can catch
other changes if they occur.

gcc/testsuite/ChangeLog:

* gcc.target/arm/mve/dlstp-compile-asm-2.c (test7): Add an optional
vmsr to the check-function-bodies.
---
 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
index c62f592a60d..fd3f68ce5b2 100644
--- a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
@@ -216,6 +216,7 @@ void test7 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
 **...
 **	dlstp.32	lr, r3
 **	vldrw.32	q[0-9]+, \[r0\], #16
+**	(?:vmsr	p0, .*)
 **	vpst
 **	vldrwt.32	q[0-9]+, \[r1\], #16
 **	vadd.i32	(q[0-9]+), q[0-9]+, q[0-9]+


[PATCH 0/3] arm, mve: Fix DLSTP testism and issue after changes in codegen

2024-11-29 Thread Andre Vieira
This patch series does not really need to be a patch series but just makes it
easier to send it all and has one common goal which is to clean up the DLSTP
implementation new to GCC 16 after some codegen changes. The first two patches
clean-up some testcases and the last fixes an actual issue that had gonne by
unnoticed until now.

Only tested on arm-none-eabi mve.exp=dlstp*. OK for trunk?

Andre Vieira (3):
  arm, mve: Fix scan-assembler for test7 in dlstp-compile-asm-2.c
  arm, mve: Pass -std=c99 to dlstp-loop-form.c to avoid new warning
  arm, mve: Detect uses of vctp_vpr_generated inside subregs

 gcc/config/arm/arm.cc |  3 ++-
 .../gcc.target/arm/mve/dlstp-compile-asm-2.c  |  1 +
 .../gcc.target/arm/mve/dlstp-invalid-asm.c| 20 ++-
 .../gcc.target/arm/mve/dlstp-loop-form.c  |  2 +-
 4 files changed, 23 insertions(+), 3 deletions(-)

-- 
2.25.1



[PATCH 2/3] arm, mve: Pass -std=c99 to dlstp-loop-form.c to avoid new warning

2024-11-29 Thread Andre Vieira

This fixes a testism introduced by the warning produced with the -std=c23
default.  The testcase is a reduced piece of code meant to trigger an ICE, so
there's little value in trying to change the code itself.

gcc/testsuite/ChangeLog:

* gcc.target/arm/mve/dlstp-loop-form.c: Add -std=c99 to avoid warning
message.
---
 gcc/testsuite/gcc.target/arm/mve/dlstp-loop-form.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-loop-form.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-loop-form.c
index 08811cef568..0f9589d7756 100644
--- a/gcc/testsuite/gcc.target/arm/mve/dlstp-loop-form.c
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-loop-form.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */
-/* { dg-options "-Ofast" } */
+/* { dg-options "-Ofast -std=c99" } */
 /* { dg-add-options arm_v8_1m_mve_fp } */
 #pragma GCC arm "arm_mve_types.h"
 #pragma GCC arm "arm_mve.h" false


[PATCH 3/3] arm, mve: Detect uses of vctp_vpr_generated inside subregs

2024-11-29 Thread Andre Vieira

Address a problem we were having where we were missing on detecting uses of
vctp_vpr_generated in the analysis for 'arm_attempt_dlstp_transform' because
the use was inside a SUBREG and rtx_equal_p does not catch that.  Using
reg_overlap_mentioned_p is much more robust.

gcc/ChangeLog:

* gcc/config/arm/arm.cc (arm_attempt_dlstp_transform): Use
reg_overlap_mentioned_p instead of rtx_equal_p to detect uses of
vctp_vpr_generated inside subregs.

gcc/testsuite/ChangeLog:

* gcc.target/arm/mve/dlstp-invalid-asm.c (test10): Renamed to...
(test10a): ... this.
(test10b): Variation of test10a with a small change to trigger wrong
codegen.
---
 gcc/config/arm/arm.cc |  3 ++-
 .../gcc.target/arm/mve/dlstp-invalid-asm.c| 20 ++-
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 7292fddef80..7f82fb94a56 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -35847,7 +35847,8 @@ arm_attempt_dlstp_transform (rtx label)
 	  df_ref insn_uses = NULL;
 	  FOR_EACH_INSN_USE (insn_uses, insn)
 	  {
-	if (rtx_equal_p (vctp_vpr_generated, DF_REF_REG (insn_uses)))
+	if (reg_overlap_mentioned_p (vctp_vpr_generated,
+	 DF_REF_REG (insn_uses)))
 	  {
 		end_sequence ();
 		return 1;
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
index 26df2d30523..f26754cc482 100644
--- a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
@@ -128,7 +128,7 @@ void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
 }
 
 /* Using a VPR that gets re-generated within the loop.  */
-void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+void test10a (int32_t *a, int32_t *b, int32_t *c, int n)
 {
   mve_pred16_t p = vctp32q (n);
   while (n > 0)
@@ -145,6 +145,24 @@ void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
 }
 }
 
+/* Using a VPR that gets re-generated within the loop.  */
+void test10b (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  mve_pred16_t p = vctp32q (n-4);
+  while (n > 0)
+{
+  int32x4_t va = vldrwq_z_s32 (a, p);
+  p = vctp32q (n);
+  int32x4_t vb = vldrwq_z_s32 (b, p);
+  int32x4_t vc = vaddq_x_s32 (va, vb, p);
+  vstrwq_p_s32 (c, vc, p);
+  c += 4;
+  a += 4;
+  b += 4;
+  n -= 4;
+}
+}
+
 /* Using vctp32q_m instead of vctp32q.  */
 void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p0)
 {


[PATCH v2 1/3] arm, mve: Fix scan-assembler for test7 in dlstp-compile-asm-2.c

2024-11-29 Thread Andre Vieira

After the changes to the vctp intrinsic codegen changed slightly, where we now
unfortunately seem to be generating unneeded moves and extends of the mask.
These are however not incorrect and we don't have a fix for the unneeded
codegen right now, so changing the testcase to accept them so we can catch
other changes if they occur.

gcc/testsuite/ChangeLog:

PR target/117814
* gcc.target/arm/mve/dlstp-compile-asm-2.c (test7): Add an optional
vmsr to the check-function-bodies.
---
 gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
index c62f592a60d..21093044708 100644
--- a/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-compile-asm-2.c
@@ -216,7 +216,12 @@ void test7 (int32_t *a, int32_t *b, int32_t *c, int n, int g)
 **...
 **	dlstp.32	lr, r3
 **	vldrw.32	q[0-9]+, \[r0\], #16
+** (
+**	vmsr	p0, .*
 **	vpst
+** |
+**	vpst
+** )
 **	vldrwt.32	q[0-9]+, \[r1\], #16
 **	vadd.i32	(q[0-9]+), q[0-9]+, q[0-9]+
 **	vstrw.32	\1, \[r2\], #16


[PATCH v2 3/3] arm, mve: Detect uses of vctp_vpr_generated inside subregs

2024-11-29 Thread Andre Vieira

Address a problem we were having where we were missing on detecting uses of
vctp_vpr_generated in the analysis for 'arm_attempt_dlstp_transform' because
the use was inside a SUBREG and rtx_equal_p does not catch that.  Using
reg_overlap_mentioned_p is much more robust.

gcc/ChangeLog:

PR target/117814
* gcc/config/arm/arm.cc (arm_attempt_dlstp_transform): Use
reg_overlap_mentioned_p instead of rtx_equal_p to detect uses of
vctp_vpr_generated inside subregs.

gcc/testsuite/ChangeLog:

PR target/117814
* gcc.target/arm/mve/dlstp-invalid-asm.c (test10): Renamed to...
(test10a): ... this.
(test10b): Variation of test10a with a small change to trigger wrong
codegen.
---
 gcc/config/arm/arm.cc |  3 +-
 .../gcc.target/arm/mve/dlstp-invalid-asm.c| 37 ++-
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 7292fddef80..7f82fb94a56 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -35847,7 +35847,8 @@ arm_attempt_dlstp_transform (rtx label)
 	  df_ref insn_uses = NULL;
 	  FOR_EACH_INSN_USE (insn_uses, insn)
 	  {
-	if (rtx_equal_p (vctp_vpr_generated, DF_REF_REG (insn_uses)))
+	if (reg_overlap_mentioned_p (vctp_vpr_generated,
+	 DF_REF_REG (insn_uses)))
 	  {
 		end_sequence ();
 		return 1;
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
index 26df2d30523..eb0782ebd0d 100644
--- a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
@@ -127,8 +127,15 @@ void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
 }
 }
 
-/* Using a VPR that gets re-generated within the loop.  */
-void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+/* Using a VPR that gets re-generated within the loop.  Even though we
+   currently reject such loops, it would be possible to dlstp transform this
+   specific loop, as long as we make sure that the first vldrwq_z mask would
+   either:
+   * remain the same as its mask in the first iteration,
+   * become the same as the loop mask after the first iteration,
+   * become all ones, since the dlstp would then mask it the same as the loop
+   mask.  */
+void test10a (int32_t *a, int32_t *b, int32_t *c, int n)
 {
   mve_pred16_t p = vctp32q (n);
   while (n > 0)
@@ -145,6 +152,32 @@ void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
 }
 }
 
+/* Using a VPR that gets re-generated within the loop, the difference between
+   this test and test10a is to make sure the two vctp calls are never the same,
+   this leads to slightly different codegen in some cases triggering the issue
+   in a different way.   This loop too would be OK to dlstp transform as long
+   as we made sure that the first vldrwq_z mask would either:
+   * remain the same as the its mask in the first iteration,
+   * become the same as the loop mask after the first iteration,
+   * become all ones, since the dlstp would then mask it the same as the loop
+   mask.  */
+void test10b (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  mve_pred16_t p = vctp32q (n-4);
+  while (n > 0)
+{
+  int32x4_t va = vldrwq_z_s32 (a, p);
+  p = vctp32q (n);
+  int32x4_t vb = vldrwq_z_s32 (b, p);
+  int32x4_t vc = vaddq_x_s32 (va, vb, p);
+  vstrwq_p_s32 (c, vc, p);
+  c += 4;
+  a += 4;
+  b += 4;
+  n -= 4;
+}
+}
+
 /* Using vctp32q_m instead of vctp32q.  */
 void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p0)
 {


[PATCH] arm, mve: Do not DLSTP transform loops if VCTP is not first

2024-11-28 Thread Andre Vieira
Hi,

This rejects any loops where any predicated instruction comes before the vctp
that generates the loop predicate.  Even though this is not a requirement for
dlstp transformation we have found potential issues where you can end up with a
wrong transformation, so it is safer to reject such loops.

OK for trunk?

gcc/ChangeLog:

* gcc/config/arm/arm.cc (arm_mve_get_loop_vctp): Reject loops with a
predicated instruction before the vctp.

gcc/testsuite/ChangeLog:

* gcc.target/arm/mve/dlstp-invalid-asm.c (test10): Renamed to...
(test10a): ... this.
(test10b): Variation of test10a with a small change to trigger an
issue.
---
 gcc/config/arm/arm.cc | 21 ++-
 .../gcc.target/arm/mve/dlstp-invalid-asm.c| 20 +-
 2 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index 7292fddef80..29c0f478f36 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -34749,11 +34749,22 @@ arm_mve_get_loop_vctp (basic_block bb)
  instruction.  We require arm_get_required_vpr_reg_param to be false
  to make sure we pick up a VCTP, rather than a VCTP_M.  */
   FOR_BB_INSNS (bb, insn)
-if (NONDEBUG_INSN_P (insn))
-  if (arm_get_required_vpr_reg_ret_val (insn)
-	  && (arm_mve_get_vctp_lanes (insn) != 0)
-	  && !arm_get_required_vpr_reg_param (insn))
-	return insn;
+{
+  if (!NONDEBUG_INSN_P (insn))
+	continue;
+  /* If we encounter a predicated instruction before the VCTP then we can
+	 not dlstp transform this loop because we would be imposing extra
+	 predication on that instruction which was not present in the original
+	 code.  */
+  if (arm_get_required_vpr_reg_param (insn))
+	return NULL;
+  if (arm_get_required_vpr_reg_ret_val (insn))
+	{
+	  if (arm_mve_get_vctp_lanes (insn) != 0)
+	return insn;
+	  return NULL;
+	}
+}
   return NULL;
 }
 
diff --git a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
index 26df2d30523..f26754cc482 100644
--- a/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
+++ b/gcc/testsuite/gcc.target/arm/mve/dlstp-invalid-asm.c
@@ -128,7 +128,7 @@ void test9 (int32_t *a, int32_t *b, int32_t *c, int n)
 }
 
 /* Using a VPR that gets re-generated within the loop.  */
-void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
+void test10a (int32_t *a, int32_t *b, int32_t *c, int n)
 {
   mve_pred16_t p = vctp32q (n);
   while (n > 0)
@@ -145,6 +145,24 @@ void test10 (int32_t *a, int32_t *b, int32_t *c, int n)
 }
 }
 
+/* Using a VPR that gets re-generated within the loop.  */
+void test10b (int32_t *a, int32_t *b, int32_t *c, int n)
+{
+  mve_pred16_t p = vctp32q (n-4);
+  while (n > 0)
+{
+  int32x4_t va = vldrwq_z_s32 (a, p);
+  p = vctp32q (n);
+  int32x4_t vb = vldrwq_z_s32 (b, p);
+  int32x4_t vc = vaddq_x_s32 (va, vb, p);
+  vstrwq_p_s32 (c, vc, p);
+  c += 4;
+  a += 4;
+  b += 4;
+  n -= 4;
+}
+}
+
 /* Using vctp32q_m instead of vctp32q.  */
 void test11 (int32_t *a, int32_t *b, int32_t *c, int n, mve_pred16_t p0)
 {


Re: [PATCH 6/8] vect: Add vector_mode paramater to simd_clone_usable

2023-09-28 Thread Andre Vieira (lists)




On 31/08/2023 07:39, Richard Biener wrote:

On Wed, Aug 30, 2023 at 5:02 PM Andre Vieira (lists)
 wrote:




On 30/08/2023 14:01, Richard Biener wrote:

On Wed, Aug 30, 2023 at 11:15 AM Andre Vieira (lists) via Gcc-patches
 wrote:


This patch adds a machine_mode parameter to the TARGET_SIMD_CLONE_USABLE
hook to enable rejecting SVE modes when the target architecture does not
support SVE.


How does the graph node of the SIMD clone lack this information?  That is, it
should have information on the types (and thus modes) for all formal arguments
and return values already, no?  At least the target would know how to
instantiate
it if it's not readily available at the point of use.



Yes it does, but that's the modes the simd clone itself uses, it does
not know what vector_mode we are currently vectorizing for. Which is
exactly why we need the vinfo's vector_mode to make sure the simd clone
and its types are compatible with the vector mode.

In practice, to make sure that a SVE simd clones are only used in loops
being vectorized for SVE modes. Having said that... I just realized that
the simdlen check already takes care of that currently...

by simdlen check I mean the one that writes off simdclones that match:
  if (!constant_multiple_p (vf, n->simdclone->simdlen, &num_calls)

However, when using -msve-vector-bits this will become an issue, as the
VF will be constant and we will match NEON simdclones.  This requires
some further attention though given that we now also reject the use of
SVE simdclones when using -msve-vector-bits, and I'm not entirely sure
we should...


Hmm, but vectorizable_simdclone should check for compatible types here
and if they are compatible why should we reject them?  Are -msve-vector-bits
"SVE" modes different from "NEON" modes?  I suppose not, because otherwise
the type compatibility check would say incompatible.

Prior to transformation we do all checks on the original scalar values, 
not the vector types. But I do believe you are right in that we don't 
need to pass the vector_mode. The simdlen check should be enough and if 
the length is the same or a multiple of the rest of the could should be 
able to deal with that and any conversions when dealing with things like 
SVE types that require the attribute.


I'll update the patch series soon and after that I'll look at how this 
reacts to -msve-vector-bits in more detail.


Thanks,
Andre


Re: Check that passes do not forget to define profile

2023-10-03 Thread Andre Vieira (lists)

Hi Honza,

My current patch set for AArch64 VLA omp codegen started failing on 
gcc.dg/gomp/pr87898.c after this. I traced it back to 
'move_sese_region_to_fn' in tree/cfg.cc not setting count for the bb 
created.


I was able to 'fix' it locally by setting the count of the new bb to the 
accumulation of e->count () of all the entry_endges (if initialized). 
I'm however not even close to certain that's the right approach, 
attached patch for illustration.


Kind regards,
Andre

On 24/08/2023 14:14, Jan Hubicka via Gcc-patches wrote:

Hi,
this patch extends verifier to check that all probabilities and counts are
initialized if profile is supposed to be present.  This is a bit complicated
by the posibility that we inline !flag_guess_branch_probability function
into function with profile defined and in this case we need to stop
verification.  For this reason I added flag to cfg structure tracking this.

Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

* cfg.h (struct control_flow_graph): New field full_profile.
* auto-profile.cc (afdo_annotate_cfg): Set full_profile to true.
* cfg.cc (init_flow): Set full_profile to false.
* graphite.cc (graphite_transform_loops): Set full_profile to false.
* lto-streamer-in.cc (input_cfg): Initialize full_profile flag.
* predict.cc (pass_profile::execute): Set full_profile to true.
* symtab-thunks.cc (expand_thunk): Set full_profile to true.
* tree-cfg.cc (gimple_verify_flow_info): Verify that profile is full
if full_profile is set.
* tree-inline.cc (initialize_cfun): Initialize full_profile.
(expand_call_inline): Combine full_profile.


diff --git a/gcc/auto-profile.cc b/gcc/auto-profile.cc
index e3af3555e75..ff3b763945c 100644
--- a/gcc/auto-profile.cc
+++ b/gcc/auto-profile.cc
@@ -1578,6 +1578,7 @@ afdo_annotate_cfg (const stmt_set &promoted_stmts)
  }
update_max_bb_count ();
profile_status_for_fn (cfun) = PROFILE_READ;
+  cfun->cfg->full_profile = true;
if (flag_value_profile_transformations)
  {
gimple_value_profile_transformations ();
diff --git a/gcc/cfg.cc b/gcc/cfg.cc
index 9eb9916f61a..b7865f14e7f 100644
--- a/gcc/cfg.cc
+++ b/gcc/cfg.cc
@@ -81,6 +81,7 @@ init_flow (struct function *the_fun)
  = ENTRY_BLOCK_PTR_FOR_FN (the_fun);
the_fun->cfg->edge_flags_allocated = EDGE_ALL_FLAGS;
the_fun->cfg->bb_flags_allocated = BB_ALL_FLAGS;
+  the_fun->cfg->full_profile = false;
  }
  
  /* Helper function for remove_edge and free_cffg.  Frees edge structure
diff --git a/gcc/cfg.h b/gcc/cfg.h
index a0e944979c8..53e2553012c 100644
--- a/gcc/cfg.h
+++ b/gcc/cfg.h
@@ -78,6 +78,9 @@ struct GTY(()) control_flow_graph {
/* Dynamically allocated edge/bb flags.  */
int edge_flags_allocated;
int bb_flags_allocated;
+
+  /* Set if the profile is computed on every edge and basic block.  */
+  bool full_profile;
  };
  
  
diff --git a/gcc/graphite.cc b/gcc/graphite.cc

index 19f8975ffa2..2b387d5b016 100644
--- a/gcc/graphite.cc
+++ b/gcc/graphite.cc
@@ -512,6 +512,8 @@ graphite_transform_loops (void)
  
if (changed)

  {
+  /* FIXME: Graphite does not update profile meaningfully currently.  */
+  cfun->cfg->full_profile = false;
cleanup_tree_cfg ();
profile_status_for_fn (cfun) = PROFILE_ABSENT;
release_recorded_exits (cfun);
diff --git a/gcc/lto-streamer-in.cc b/gcc/lto-streamer-in.cc
index 0cce14414ca..d3128fcebe4 100644
--- a/gcc/lto-streamer-in.cc
+++ b/gcc/lto-streamer-in.cc
@@ -1030,6 +1030,7 @@ input_cfg (class lto_input_block *ib, class data_in 
*data_in,
basic_block p_bb;
unsigned int i;
int index;
+  bool full_profile = false;
  
init_empty_tree_cfg_for_function (fn);
  
@@ -1071,6 +1072,8 @@ input_cfg (class lto_input_block *ib, class data_in *data_in,

  data_in->location_cache.input_location_and_block (&e->goto_locus,
&bp, ib, data_in);
  e->probability = profile_probability::stream_in (ib);
+ if (!e->probability.initialized_p ())
+   full_profile = false;
  
  	}
  
@@ -1145,6 +1148,7 @@ input_cfg (class lto_input_block *ib, class data_in *data_in,
  
/* Rebuild the loop tree.  */

flow_loops_find (loops);
+  cfun->cfg->full_profile = full_profile;
  }
  
  
diff --git a/gcc/predict.cc b/gcc/predict.cc

index 5a1a561cc24..396746cbfd1 100644
--- a/gcc/predict.cc
+++ b/gcc/predict.cc
@@ -4131,6 +4131,7 @@ pass_profile::execute (function *fun)
  scev_initialize ();
  
tree_estimate_probability (false);

+  cfun->cfg->full_profile = true;
  
if (nb_loops > 1)

  scev_finalize ();
diff --git a/gcc/symtab-thunks.cc b/gcc/symtab-thunks.cc
index 4c04235c41b..23ead0d2138 100644
--- a/gcc/symtab-thunks.cc
+++ b/gcc/symtab-thunks.cc
@@ -648,6 +648,7 @@ expand_thunk (cgraph_node *node, bool output_asm_thunks,
  ? PROFILE_READ : PROFILE_GUESSED;
/*

Re: [PATCH7/8] vect: Add TARGET_SIMD_CLONE_ADJUST_RET_OR_PARAM

2023-10-04 Thread Andre Vieira (lists)




On 30/08/2023 14:04, Richard Biener wrote:

On Wed, 30 Aug 2023, Andre Vieira (lists) wrote:


This patch adds a new target hook to enable us to adapt the types of return
and parameters of simd clones.  We use this in two ways, the first one is to
make sure we can create valid SVE types, including the SVE type attribute,
when creating a SVE simd clone, even when the target options do not support
SVE.  We are following the same behaviour seen with x86 that creates simd
clones according to the ABI rules when no simdlen is provided, even if that
simdlen is not supported by the current target options.  Note that this
doesn't mean the simd clone will be used in auto-vectorization.


You are not documenting the bool parameter of the new hook.

What's wrong with doing the adjustment in TARGET_SIMD_CLONE_ADJUST?


simd_clone_adjust_argument_types is called after that hook, so by the 
time we call TARGET_SIMD_CLONE_ADJUST the types are still in scalar, not 
vector.  The same is true for the return type one.


Also the changes to the types need to be taken into consideration in 
'adjustments' I think.


PS: I hope the subject line survived, my email client is having a bit of 
a wobble this morning... it's what you get for updating software :(


Re: [PATCH7/8] vect: Add TARGET_SIMD_CLONE_ADJUST_RET_OR_PARAM

2023-10-04 Thread Andre Vieira (lists)




On 04/10/2023 11:41, Richard Biener wrote:

On Wed, 4 Oct 2023, Andre Vieira (lists) wrote:




On 30/08/2023 14:04, Richard Biener wrote:

On Wed, 30 Aug 2023, Andre Vieira (lists) wrote:


This patch adds a new target hook to enable us to adapt the types of return
and parameters of simd clones.  We use this in two ways, the first one is
to
make sure we can create valid SVE types, including the SVE type attribute,
when creating a SVE simd clone, even when the target options do not support
SVE.  We are following the same behaviour seen with x86 that creates simd
clones according to the ABI rules when no simdlen is provided, even if that
simdlen is not supported by the current target options.  Note that this
doesn't mean the simd clone will be used in auto-vectorization.


You are not documenting the bool parameter of the new hook.

What's wrong with doing the adjustment in TARGET_SIMD_CLONE_ADJUST?


simd_clone_adjust_argument_types is called after that hook, so by the time we
call TARGET_SIMD_CLONE_ADJUST the types are still in scalar, not vector.  The
same is true for the return type one.

Also the changes to the types need to be taken into consideration in
'adjustments' I think.


Nothing in the three existing implementations of TARGET_SIMD_CLONE_ADJUST
relies on this ordering I think, how about moving the hook invocation
after simd_clone_adjust_argument_types?



But that wouldn't change the 'ipa_param_body_adjustments' for when we 
have a function definition and we need to redo the body.

Richard.


PS: I hope the subject line survived, my email client is having a bit of a
wobble this morning... it's what you get for updating software :(


Re: [PATCH] aarch64: enable mixed-types for aarch64 simdclones

2023-10-16 Thread Andre Vieira (lists)

Hey,

Just a minor update to the patch, I had missed the libgomp testsuite, so 
had to make some adjustments there too.


gcc/ChangeLog:

* config/aarch64/aarch64.cc (lane_size): New function.
(aarch64_simd_clone_compute_vecsize_and_simdlen): Determine 
simdlen according to NDS rule
and reject combination of simdlen and types that lead to 
vectors larger than 128bits.


gcc/testsuite/ChangeLog:

* lib/target-supports.exp: Add aarch64 targets to vect_simd_clones.
* c-c++-common/gomp/declare-variant-14.c: Adapt test for aarch64.
* c-c++-common/gomp/pr60823-1.c: Likewise.
* c-c++-common/gomp/pr60823-2.c: Likewise.
* c-c++-common/gomp/pr60823-3.c: Likewise.
* g++.dg/gomp/attrs-10.C: Likewise.
* g++.dg/gomp/declare-simd-1.C: Likewise.
* g++.dg/gomp/declare-simd-3.C: Likewise.
* g++.dg/gomp/declare-simd-4.C: Likewise.
* g++.dg/gomp/declare-simd-7.C: Likewise.
* g++.dg/gomp/declare-simd-8.C: Likewise.
* g++.dg/gomp/pr88182.C: Likewise.
* gcc.dg/declare-simd.c: Likewise.
* gcc.dg/gomp/declare-simd-1.c: Likewise.
* gcc.dg/gomp/declare-simd-3.c: Likewise.
* gcc.dg/gomp/pr87887-1.c: Likewise.
* gcc.dg/gomp/pr87895-1.c: Likewise.
* gcc.dg/gomp/pr89246-1.c: Likewise.
* gcc.dg/gomp/pr99542.c: Likewise.
* gcc.dg/gomp/simd-clones-2.c: Likewise.
* gcc.dg/gcc.dg/vect/vect-simd-clone-1.c: Likewise.
* gcc.dg/gcc.dg/vect/vect-simd-clone-2.c: Likewise.
* gcc.dg/gcc.dg/vect/vect-simd-clone-4.c: Likewise.
* gcc.dg/gcc.dg/vect/vect-simd-clone-5.c: Likewise.
* gcc.dg/gcc.dg/vect/vect-simd-clone-8.c: Likewise.
* gfortran.dg/gomp/declare-simd-2.f90: Likewise.
* gfortran.dg/gomp/declare-simd-coarray-lib.f90: Likewise.
* gfortran.dg/gomp/declare-variant-14.f90: Likewise.
* gfortran.dg/gomp/pr79154-1.f90: Likewise.
* gfortran.dg/gomp/pr83977.f90: Likewise.

libgomp/testsuite/ChangeLog:

* libgomp.c/declare-variant-1.c: Adapt test for aarch64.
* libgomp.fortran/declare-simd-1.f90: Likewise.diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
9fbfc548a891f5d11940c6fd3c49a14bfbdec886..37507f091c2a6154fa944c3a9fad6a655ab5d5a1
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -27414,33 +27414,62 @@ supported_simd_type (tree t)
   return false;
 }
 
-/* Return true for types that currently are supported as SIMD return
-   or argument types.  */
+/* Determine the lane size for the clone argument/return type.  This follows
+   the LS(P) rule in the VFABIA64.  */
 
-static bool
-currently_supported_simd_type (tree t, tree b)
+static unsigned
+lane_size (cgraph_simd_clone_arg_type clone_arg_type, tree type)
 {
-  if (COMPLEX_FLOAT_TYPE_P (t))
-return false;
+  gcc_assert (clone_arg_type != SIMD_CLONE_ARG_TYPE_MASK);
 
-  if (TYPE_SIZE (t) != TYPE_SIZE (b))
-return false;
+  /* For non map-to-vector types that are pointers we use the element type it
+ points to.  */
+  if (POINTER_TYPE_P (type))
+switch (clone_arg_type)
+  {
+  default:
+   break;
+  case SIMD_CLONE_ARG_TYPE_UNIFORM:
+  case SIMD_CLONE_ARG_TYPE_LINEAR_CONSTANT_STEP:
+  case SIMD_CLONE_ARG_TYPE_LINEAR_VARIABLE_STEP:
+   type = TREE_TYPE (type);
+   break;
+  }
 
-  return supported_simd_type (t);
+  /* For types (or types pointers of non map-to-vector types point to) that are
+ integers or floating point, we use their size if they are 1, 2, 4 or 8.
+   */
+  if (INTEGRAL_TYPE_P (type)
+  || SCALAR_FLOAT_TYPE_P (type))
+  switch (TYPE_PRECISION (type) / BITS_PER_UNIT)
+   {
+   default:
+ break;
+   case 1:
+   case 2:
+   case 4:
+   case 8:
+ return TYPE_PRECISION (type);
+   }
+  /* For any other we use the size of uintptr_t.  For map-to-vector types that
+ are pointers, using the size of uintptr_t is the same as using the size of
+ their type, seeing all pointers are the same size as uintptr_t.  */
+  return POINTER_SIZE;
 }
 
+
 /* Implement TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN.  */
 
 static int
 aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
struct cgraph_simd_clone *clonei,
-   tree base_type, int num,
-   bool explicit_p)
+   tree base_type ATTRIBUTE_UNUSED,
+   int num, bool explicit_p)
 {
   tree t, ret_type;
-  unsigned int elt_bits, count;
+  unsigned int nds_elt_bits;
+  int count;
   unsigned HOST_WIDE_INT const_simdlen;
-  poly_uint64 vec_bits;
 
   if (!TARGET_SIMD)
 return 0;
@@ -27460,80 +27489,135 @@ aarch64_simd_clone_compute_vecsize_and_simdlen 
(struct cgraph_node *node,
 }
 
   ret_typ

Re: Check that passes do not forget to define profile

2023-10-17 Thread Andre Vieira (lists)

So OK to commit this?

This patch makes sure the profile_count information is initialized for 
the new

bb created in move_sese_region_to_fn.

gcc/ChangeLog:

* tree-cfg.cc (move_sese_region_to_fn): Initialize profile_count for
new basic block.

Bootstrapped and regression tested on aarch64-unknown-linux-gnu and 
x86_64-pc-linux-gnu.


On 04/10/2023 12:02, Jan Hubicka wrote:

Hi Honza,

My current patch set for AArch64 VLA omp codegen started failing on
gcc.dg/gomp/pr87898.c after this. I traced it back to
'move_sese_region_to_fn' in tree/cfg.cc not setting count for the bb
created.

I was able to 'fix' it locally by setting the count of the new bb to the
accumulation of e->count () of all the entry_endges (if initialized). I'm
however not even close to certain that's the right approach, attached patch
for illustration.

Kind regards,
Andre
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc



index 
ffab7518b1568b58e610e26feb9e3cab18ddb3c2..32fc47ae683164bf8fac477fbe6e2c998382e754
 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -8160,11 +8160,15 @@ move_sese_region_to_fn (struct function *dest_cfun, 
basic_block entry_bb,
bb = create_empty_bb (entry_pred[0]);
if (current_loops)
  add_bb_to_loop (bb, loop);
+  profile_count count = profile_count::zero ();
for (i = 0; i < num_entry_edges; i++)
  {
e = make_edge (entry_pred[i], bb, entry_flag[i]);
e->probability = entry_prob[i];
+  if (e->count ().initialized_p ())
+   count += e->count ();
  }
+  bb->count = count;


This looks generally right - if you create a BB you need to set its
count and unless it has self-loop that is the sum of counts of
incommping edges.

However the initialized_p check should be unnecessary: if one of entry
edges to BB is uninitialized, the + operation will make bb count
uninitialized too, which is OK.

Honza
  
for (i = 0; i < num_exit_edges; i++)

  {
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index 
ffab7518b1568b58e610e26feb9e3cab18ddb3c2..ffeb20b717aead756844c5f48c2cc23f5e9f14a6
 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -8160,11 +8160,14 @@ move_sese_region_to_fn (struct function *dest_cfun, 
basic_block entry_bb,
   bb = create_empty_bb (entry_pred[0]);
   if (current_loops)
 add_bb_to_loop (bb, loop);
+  profile_count count = profile_count::zero ();
   for (i = 0; i < num_entry_edges; i++)
 {
   e = make_edge (entry_pred[i], bb, entry_flag[i]);
   e->probability = entry_prob[i];
+  count += e->count ();
 }
+  bb->count = count;
 
   for (i = 0; i < num_exit_edges; i++)
 {


Re: aarch64, vect, omp: Add SVE support for simd clones [PR 96342]

2023-10-18 Thread Andre Vieira (lists)

Hi,

I noticed I had missed one of the preparatory patches at the start of 
this series (first one) added now, also removed the 'vect: Add 
vector_mode paramater to simd_clone_usable' since after review we no 
longer deemed it necessary. And replaced the old vect: Add 
TARGET_SIMD_CLONE_ADJUST_RET_OR_PARAM with omp: Reorder call for 
TARGET_SIMD_CLONE_ADJUST after comments.


Bootstrapped and regression tested the series on 
aarch64-unknown-linux-gnu and x86_64-pc-linux-gnu.



Andre Vieira (8):

omp: Replace simd_clone_supbarts with TYPE_VECTOR_SUBPARTS [NEW]
parloops: Copy target and optimizations when creating a function clone 
[Reviewed]
parloops: Allow poly nit and bound [Cond Reviewed, made the requested 
changes]
vect: Fix vect_get_smallest_scalar_type for simd clones [First Reviewe, 
made the requested changes, OK?]
vect: don't allow fully masked loops with non-masked simd clones [PR 
110485] [Reviewed]

vect: Use inbranch simdclones in masked loops [Needs review]
vect: omp: Reorder call for TARGET_SIMD_CLONE_ADJUST [NEW]
aarch64: Add SVE support for simd clones [PR 96342] [Needs review]

PS: apologies for the inconsistent numbering of the emails, things got a 
bit confusing with removing and adding patches to the series.


On 30/08/2023 09:49, Andre Vieira (lists) via Gcc-patches wrote:

Hi,

This patch series aims to implement support for SVE simd clones when not 
specifying a 'simdlen' clause for AArch64. This patch depends on my 
earlier patch: '[PATCH] aarch64: enable mixed-types for aarch64 
simdclones'.


Bootstrapped and regression tested the series on 
aarch64-unknown-linux-gnu and x86_64-pc-linux-gnu. I also tried building 
the patches separately, but that was before some further clean-up 
restructuring, so will do that again prior to pushing.


Andre Vieira (8):

parloops: Copy target and optimizations when creating a function clone
parloops: Allow poly nit and bound
vect: Fix vect_get_smallest_scalar_type for simd clones
vect: don't allow fully masked loops with non-masked simd clones [PR 
110485]

vect: Use inbranch simdclones in masked loops
vect: Add vector_mode paramater to simd_clone_usable
vect: Add TARGET_SIMD_CLONE_ADJUST_RET_OR_PARAM
aarch64: Add SVE support for simd clones [PR 96342]


Re: [PATCH 1/8] parloops: Copy target and optimizations when creating a function clone

2023-10-18 Thread Andre Vieira (lists)

Just posting a rebase for completion.

On 30/08/2023 13:31, Richard Biener wrote:

On Wed, 30 Aug 2023, Andre Vieira (lists) wrote:



SVE simd clones require to be compiled with a SVE target enabled or the
argument types will not be created properly. To achieve this we need to copy
DECL_FUNCTION_SPECIFIC_TARGET from the original function declaration to the
clones.  I decided it was probably also a good idea to copy
DECL_FUNCTION_SPECIFIC_OPTIMIZATION in case the original function is meant to
be compiled with specific optimization options.


OK.


gcc/ChangeLog:

* tree-parloops.cc (create_loop_fn): Copy specific target and
optimization options to clone.

diff --git a/gcc/tree-parloops.cc b/gcc/tree-parloops.cc
index 
e495bbd65270bdf90bae2c4a2b52777522352a77..a35f3d5023b06e5ef96eb4222488fcb34dd7bd45
 100644
--- a/gcc/tree-parloops.cc
+++ b/gcc/tree-parloops.cc
@@ -2203,6 +2203,11 @@ create_loop_fn (location_t loc)
   DECL_CONTEXT (t) = decl;
   TREE_USED (t) = 1;
   DECL_ARGUMENTS (decl) = t;
+  DECL_FUNCTION_SPECIFIC_TARGET (decl)
+= DECL_FUNCTION_SPECIFIC_TARGET (act_cfun->decl);
+  DECL_FUNCTION_SPECIFIC_OPTIMIZATION (decl)
+= DECL_FUNCTION_SPECIFIC_OPTIMIZATION (act_cfun->decl);
+
 
   allocate_struct_function (decl, false);
 


Re: [Patch 2/8] parloops: Allow poly nit and bound

2023-10-18 Thread Andre Vieira (lists)

Posting the changed patch for completion, already reviewed.

On 30/08/2023 13:32, Richard Biener wrote:

On Wed, 30 Aug 2023, Andre Vieira (lists) wrote:


Teach parloops how to handle a poly nit and bound e ahead of the changes to
enable non-constant simdlen.


Can you use poly_int_tree_p to combine INTEGER_CST || POLY_INT_CST please?

OK with that change.


gcc/ChangeLog:

* tree-parloops.cc (try_to_transform_to_exit_first_loop_alt): Accept
poly NIT and ALT_BOUND.

diff --git a/gcc/tree-parloops.cc b/gcc/tree-parloops.cc
index 
a35f3d5023b06e5ef96eb4222488fcb34dd7bd45..80f3dd6dce281e1eb1d76d38bd09e6638a875142
 100644
--- a/gcc/tree-parloops.cc
+++ b/gcc/tree-parloops.cc
@@ -2531,14 +2531,15 @@ try_transform_to_exit_first_loop_alt (class loop *loop,
   tree nit_type = TREE_TYPE (nit);
 
   /* Figure out whether nit + 1 overflows.  */
-  if (TREE_CODE (nit) == INTEGER_CST)
+  if (poly_int_tree_p (nit))
 {
   if (!tree_int_cst_equal (nit, TYPE_MAX_VALUE (nit_type)))
{
  alt_bound = fold_build2_loc (UNKNOWN_LOCATION, PLUS_EXPR, nit_type,
   nit, build_one_cst (nit_type));
 
- gcc_assert (TREE_CODE (alt_bound) == INTEGER_CST);
+ gcc_assert (TREE_CODE (alt_bound) == INTEGER_CST
+ || TREE_CODE (alt_bound) == POLY_INT_CST);
  transform_to_exit_first_loop_alt (loop, reduction_list, alt_bound);
  return true;
}


Re: [Patch 3/8] vect: Fix vect_get_smallest_scalar_type for simd clones

2023-10-18 Thread Andre Vieira (lists)

Made it a local function and changed prototype according to comments.

Is this OK?

 gcc/ChangeLog:
* tree-vect-data-refs.cc (vect_get_smallest_scalar_type): Special
case
simd clone calls and only use types that are mapped to vectors.
(simd_clone_call_p): New helper function.

On 30/08/2023 13:54, Richard Biener wrote:

On Wed, 30 Aug 2023, Andre Vieira (lists) wrote:


The vect_get_smallest_scalar_type helper function was using any argument to a
simd clone call when trying to determine the smallest scalar type that would
be vectorized.  This included the function pointer type in a MASK_CALL for
instance, and would result in the wrong type being selected.  Instead this
patch special cases simd_clone_call's and uses only scalar types of the
original function that get transformed into vector types.


Looks sensible.

+bool
+simd_clone_call_p (gimple *stmt, cgraph_node **out_node)

you could return the cgraph_node * or NULL here.  Are you going to
use the function elsewhere?  Otherwise put it in the same TU as
the only use please and avoid exporting it.

Richard.


gcc/ChangeLog:

* tree-vect-data-refs.cci (vect_get_smallest_scalar_type): Special
case
simd clone calls and only use types that are mapped to vectors.
* tree-vect-stmts.cc (simd_clone_call_p): New helper function.
* tree-vectorizer.h (simd_clone_call_p): Declare new function.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-simd-clone-16f.c: Remove unnecessary differentation
between targets with different pointer sizes.
* gcc.dg/vect/vect-simd-clone-17f.c: Likewise.
* gcc.dg/vect/vect-simd-clone-18f.c: Likewise.

diff --git a/gcc/testsuite/gcc.dg/vect/vect-simd-clone-16f.c 
b/gcc/testsuite/gcc.dg/vect/vect-simd-clone-16f.c
index 
574698d3e133ecb8700e698fa42a6b05dd6b8a18..7cd29e894d0502a59fadfe67db2db383133022d3
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-simd-clone-16f.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-simd-clone-16f.c
@@ -7,9 +7,8 @@
 #include "vect-simd-clone-16.c"
 
 /* Ensure the the in-branch simd clones are used on targets that support them.
-   Some targets use pairs of vectors and do twice the calls.  */
-/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 2 "vect" 
{ target { ! { { i?86-*-* x86_64-*-* } && { ! lp64 } } } } } } */
-/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 4 "vect" 
{ target { { i?86*-*-* x86_64-*-* } && { ! lp64 } } } } } */
+ */
+/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 2 "vect" 
} } */
 
 /* The LTO test produces two dump files and we scan the wrong one.  */
 /* { dg-skip-if "" { *-*-* } { "-flto" } { "" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-simd-clone-17f.c 
b/gcc/testsuite/gcc.dg/vect/vect-simd-clone-17f.c
index 
8bb6d19301a67a3eebce522daaf7d54d88f708d7..177521dc44531479fca1f1a1a0f2010f30fa3fb5
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-simd-clone-17f.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-simd-clone-17f.c
@@ -7,9 +7,8 @@
 #include "vect-simd-clone-17.c"
 
 /* Ensure the the in-branch simd clones are used on targets that support them.
-   Some targets use pairs of vectors and do twice the calls.  */
-/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 2 "vect" 
{ target { ! { { i?86-*-* x86_64-*-* } && { ! lp64 } } } } } } */
-/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 4 "vect" 
{ target { { i?86*-*-* x86_64-*-* } && { ! lp64 } } } } } */
+ */
+/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 2 "vect" 
} } */
 
 /* The LTO test produces two dump files and we scan the wrong one.  */
 /* { dg-skip-if "" { *-*-* } { "-flto" } { "" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-simd-clone-18f.c 
b/gcc/testsuite/gcc.dg/vect/vect-simd-clone-18f.c
index 
d34f23f4db8e9c237558cc22fe66b7e02b9e6c20..4dd51381d73c0c7c8ec812f24e5054df038059c5
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-simd-clone-18f.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-simd-clone-18f.c
@@ -7,9 +7,8 @@
 #include "vect-simd-clone-18.c"
 
 /* Ensure the the in-branch simd clones are used on targets that support them.
-   Some targets use pairs of vectors and do twice the calls.  */
-/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 2 "vect" 
{ target { ! { { i?86-*-* x86_64-*-* } && { ! lp64 } } } } } } */
-/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 4 "vect" 
{ target { { i?86*-*-* x86_64-*-* } && { ! lp64 } } } } } */
+ */
+/* { dg-final { scan-tree-dump-times {[\n\r] [^\n]* = foo\.simdclone} 2 "vect" 
} } */
 
 /* The LTO test produces two dump files and we scan the wrong one.  */
 /* { dg-skip-if "" { *-*

Re: [PATCH 5/8] vect: Use inbranch simdclones in masked loops

2023-10-18 Thread Andre Vieira (lists)

Rebased, needs review.

On 30/08/2023 10:13, Andre Vieira (lists) via Gcc-patches wrote:
This patch enables the compiler to use inbranch simdclones when 
generating masked loops in autovectorization.


gcc/ChangeLog:

 * omp-simd-clone.cc (simd_clone_adjust_argument_types): Make function
 compatible with mask parameters in clone.
 * tree-vect-stmts.cc (vect_convert): New helper function.
 (vect_build_all_ones_mask): Allow vector boolean typed masks.
 (vectorizable_simd_clone_call): Enable the use of masked clones in
 fully masked loops.diff --git a/gcc/omp-simd-clone.cc b/gcc/omp-simd-clone.cc
index 
a42643400ddcf10961633448b49d4caafb999f12..ef0b9b48c7212900023bc0eaebca5e1f9389db77
 100644
--- a/gcc/omp-simd-clone.cc
+++ b/gcc/omp-simd-clone.cc
@@ -807,8 +807,14 @@ simd_clone_adjust_argument_types (struct cgraph_node *node)
 {
   ipa_adjusted_param adj;
   memset (&adj, 0, sizeof (adj));
-  tree parm = args[i];
-  tree parm_type = node->definition ? TREE_TYPE (parm) : parm;
+  tree parm = NULL_TREE;
+  tree parm_type = NULL_TREE;
+  if(i < args.length())
+   {
+ parm = args[i];
+ parm_type = node->definition ? TREE_TYPE (parm) : parm;
+   }
+
   adj.base_index = i;
   adj.prev_clone_index = i;
 
@@ -1547,7 +1553,7 @@ simd_clone_adjust (struct cgraph_node *node)
  mask = gimple_assign_lhs (g);
  g = gimple_build_assign (make_ssa_name (TREE_TYPE (mask)),
   BIT_AND_EXPR, mask,
-  build_int_cst (TREE_TYPE (mask), 1));
+  build_one_cst (TREE_TYPE (mask)));
  gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
  mask = gimple_assign_lhs (g);
}
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 
731acc76350cae39c899a866584068cff247183a..6e2c70c1d3970af652c1e50e41b144162884bf24
 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1594,6 +1594,20 @@ check_load_store_for_partial_vectors (loop_vec_info 
loop_vinfo, tree vectype,
 }
 }
 
+/* Return SSA name of the result of the conversion of OPERAND into type TYPE.
+   The conversion statement is inserted at GSI.  */
+
+static tree
+vect_convert (vec_info *vinfo, stmt_vec_info stmt_info, tree type, tree 
operand,
+ gimple_stmt_iterator *gsi)
+{
+  operand = build1 (VIEW_CONVERT_EXPR, type, operand);
+  gassign *new_stmt = gimple_build_assign (make_ssa_name (type),
+  operand);
+  vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
+  return gimple_get_lhs (new_stmt);
+}
+
 /* Return the mask input to a masked load or store.  VEC_MASK is the vectorized
form of the scalar mask condition and LOOP_MASK, if nonnull, is the mask
that needs to be applied to all loads and stores in a vectorized loop.
@@ -2547,7 +2561,8 @@ vect_build_all_ones_mask (vec_info *vinfo,
 {
   if (TREE_CODE (masktype) == INTEGER_TYPE)
 return build_int_cst (masktype, -1);
-  else if (TREE_CODE (TREE_TYPE (masktype)) == INTEGER_TYPE)
+  else if (VECTOR_BOOLEAN_TYPE_P (masktype)
+  || TREE_CODE (TREE_TYPE (masktype)) == INTEGER_TYPE)
 {
   tree mask = build_int_cst (TREE_TYPE (masktype), -1);
   mask = build_vector_from_val (masktype, mask);
@@ -4156,7 +4171,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
   size_t i, nargs;
   tree lhs, rtype, ratype;
   vec *ret_ctor_elts = NULL;
-  int arg_offset = 0;
+  int masked_call_offset = 0;
 
   /* Is STMT a vectorizable call?   */
   gcall *stmt = dyn_cast  (stmt_info->stmt);
@@ -4171,7 +4186,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
   gcc_checking_assert (TREE_CODE (fndecl) == ADDR_EXPR);
   fndecl = TREE_OPERAND (fndecl, 0);
   gcc_checking_assert (TREE_CODE (fndecl) == FUNCTION_DECL);
-  arg_offset = 1;
+  masked_call_offset = 1;
 }
   if (fndecl == NULL_TREE)
 return false;
@@ -4199,7 +4214,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
 return false;
 
   /* Process function arguments.  */
-  nargs = gimple_call_num_args (stmt) - arg_offset;
+  nargs = gimple_call_num_args (stmt) - masked_call_offset;
 
   /* Bail out if the function has zero arguments.  */
   if (nargs == 0)
@@ -4221,7 +4236,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
   thisarginfo.op = NULL_TREE;
   thisarginfo.simd_lane_linear = false;
 
-  int op_no = i + arg_offset;
+  int op_no = i + masked_call_offset;
   if (slp_node)
op_no = vect_slp_child_index_for_operand (stmt, op_no);
   if (!vect_is_simple_use (vinfo, stmt_info, slp_node,
@@ -4303,16 +4318,6 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
   arginfo.quick_push (thisarginfo);
 }
 
-  if (loop_vinfo
-  && !LOOP_V

[PATCH 0/8] omp: Replace simd_clone_subparts with TYPE_VECTOR_SUBPARTS

2023-10-18 Thread Andre Vieira (lists)


Refactor simd clone handling code ahead of support for poly simdlen.

gcc/ChangeLog:

* omp-simd-clone.cc (simd_clone_subparts): Remove.
(simd_clone_init_simd_arrays): Replace simd_clone_supbarts with
TYPE_VECTOR_SUBPARTS.
(ipa_simd_modify_function_body): Likewise.
* tree-vect-stmts.cc (vectorizable_simd_clone_call): Likewise.
(simd_clone_subparts): Remove.
diff --git a/gcc/omp-simd-clone.cc b/gcc/omp-simd-clone.cc
index 
c1cb7cc8a5c770940bc2032f824e084b37e96dbe..a42643400ddcf10961633448b49d4caafb999f12
 100644
--- a/gcc/omp-simd-clone.cc
+++ b/gcc/omp-simd-clone.cc
@@ -255,16 +255,6 @@ ok_for_auto_simd_clone (struct cgraph_node *node)
   return true;
 }
 
-
-/* Return the number of elements in vector type VECTYPE, which is associated
-   with a SIMD clone.  At present these always have a constant length.  */
-
-static unsigned HOST_WIDE_INT
-simd_clone_subparts (tree vectype)
-{
-  return TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
-}
-
 /* Allocate a fresh `simd_clone' and return it.  NARGS is the number
of arguments to reserve space for.  */
 
@@ -1028,7 +1018,7 @@ simd_clone_init_simd_arrays (struct cgraph_node *node,
}
  continue;
}
-  if (known_eq (simd_clone_subparts (TREE_TYPE (arg)),
+  if (known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)),
node->simdclone->simdlen))
{
  tree ptype = build_pointer_type (TREE_TYPE (TREE_TYPE (array)));
@@ -1040,7 +1030,7 @@ simd_clone_init_simd_arrays (struct cgraph_node *node,
}
   else
{
- unsigned int simdlen = simd_clone_subparts (TREE_TYPE (arg));
+ poly_uint64 simdlen = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg));
  unsigned int times = vector_unroll_factor (node->simdclone->simdlen,
 simdlen);
  tree ptype = build_pointer_type (TREE_TYPE (TREE_TYPE (array)));
@@ -1226,9 +1216,9 @@ ipa_simd_modify_function_body (struct cgraph_node *node,
  iter, NULL_TREE, NULL_TREE);
   adjustments->register_replacement (&(*adjustments->m_adj_params)[j], r);
 
-  if (multiple_p (node->simdclone->simdlen, simd_clone_subparts (vectype)))
+  if (multiple_p (node->simdclone->simdlen, TYPE_VECTOR_SUBPARTS 
(vectype)))
j += vector_unroll_factor (node->simdclone->simdlen,
-  simd_clone_subparts (vectype)) - 1;
+  TYPE_VECTOR_SUBPARTS (vectype)) - 1;
 }
   adjustments->sort_replacements ();
 
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 
9bb43e98f56d18929c9c02227954fdf38eafefd8..a9156975d64c7a335ffd27614e87f9d11b23d1ba
 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -4126,16 +4126,6 @@ vect_simd_lane_linear (tree op, class loop *loop,
 }
 }
 
-/* Return the number of elements in vector type VECTYPE, which is associated
-   with a SIMD clone.  At present these vectors always have a constant
-   length.  */
-
-static unsigned HOST_WIDE_INT
-simd_clone_subparts (tree vectype)
-{
-  return TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
-}
-
 /* Function vectorizable_simd_clone_call.
 
Check if STMT_INFO performs a function call that can be vectorized
@@ -4429,7 +4419,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
slp_node);
  if (arginfo[i].vectype == NULL
  || !constant_multiple_p (bestn->simdclone->simdlen,
-  simd_clone_subparts 
(arginfo[i].vectype)))
+  TYPE_VECTOR_SUBPARTS 
(arginfo[i].vectype)))
return false;
}
 
@@ -,10 +4434,11 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
 
   if (bestn->simdclone->args[i].arg_type == SIMD_CLONE_ARG_TYPE_MASK)
{
+ tree clone_arg_vectype = bestn->simdclone->args[i].vector_type;
  if (bestn->simdclone->mask_mode == VOIDmode)
{
- if (simd_clone_subparts (bestn->simdclone->args[i].vector_type)
- != simd_clone_subparts (arginfo[i].vectype))
+ if (maybe_ne (TYPE_VECTOR_SUBPARTS (clone_arg_vectype),
+   TYPE_VECTOR_SUBPARTS (arginfo[i].vectype)))
{
  /* FORNOW we only have partial support for vector-type masks
 that can't hold all of simdlen. */
@@ -4464,7 +4455,7 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
  if (!SCALAR_INT_MODE_P (TYPE_MODE (arginfo[i].vectype))
  || maybe_ne (exact_div (bestn->simdclone->simdlen,
  num_mask_args),
-  simd_clone_subparts (arginfo[i].vectype)))
+  TYPE_VECTOR_SUBPARTS (arginfo[i].vectype)))

Re: [PATCH 4/8] vect: don't allow fully masked loops with non-masked simd clones [PR 110485]

2023-10-18 Thread Andre Vieira (lists)
Rebased on top of trunk, minor change to check if loop_vinfo since we 
now do some slp vectorization for simd_clones.


I assume the previous OK still holds.

On 30/08/2023 13:54, Richard Biener wrote:

On Wed, 30 Aug 2023, Andre Vieira (lists) wrote:


When analyzing a loop and choosing a simdclone to use it is possible to choose
a simdclone that cannot be used 'inbranch' for a loop that can use partial
vectors.  This may lead to the vectorizer deciding to use partial vectors
which are not supported for notinbranch simd clones. This patch fixes that by
disabling the use of partial vectors once a notinbranch simd clone has been
selected.


OK.


gcc/ChangeLog:

PR tree-optimization/110485
* tree-vect-stmts.cc (vectorizable_simd_clone_call): Disable partial
vectors usage if a notinbranch simdclone has been selected.

gcc/testsuite/ChangeLog:

* gcc.dg/gomp/pr110485.c: New test.

diff --git a/gcc/testsuite/gcc.dg/gomp/pr110485.c 
b/gcc/testsuite/gcc.dg/gomp/pr110485.c
new file mode 100644
index 
..ba6817a127f40246071e32ccebf692cc4d121d15
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/gomp/pr110485.c
@@ -0,0 +1,19 @@
+/* PR 110485 */
+/* { dg-do compile } */
+/* { dg-additional-options "-Ofast -fdump-tree-vect-details" } */
+/* { dg-additional-options "-march=znver4 --param=vect-partial-vector-usage=1" 
{ target x86_64-*-* } } */
+#pragma omp declare simd notinbranch uniform(p)
+extern double __attribute__ ((const)) bar (double a, double p);
+
+double a[1024];
+double b[1024];
+
+void foo (int n)
+{
+  #pragma omp simd
+  for (int i = 0; i < n; ++i)
+a[i] = bar (b[i], 71.2);
+}
+
+/* { dg-final { scan-tree-dump-not "MASK_LOAD" "vect" } } */
+/* { dg-final { scan-tree-dump "can't use a fully-masked loop because a 
non-masked simd clone was selected." "vect" { target x86_64-*-* } } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 
a9156975d64c7a335ffd27614e87f9d11b23d1ba..731acc76350cae39c899a866584068cff247183a
 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -4539,6 +4539,17 @@ vectorizable_simd_clone_call (vec_info *vinfo, 
stmt_vec_info stmt_info,
   ? boolean_true_node : boolean_false_node;
simd_clone_info.safe_push (sll);
  }
+
+  if (!bestn->simdclone->inbranch && loop_vinfo)
+   {
+ if (dump_enabled_p ()
+ && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+   dump_printf_loc (MSG_NOTE, vect_location,
+"can't use a fully-masked loop because a"
+" non-masked simd clone was selected.\n");
+ LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+   }
+
   STMT_VINFO_TYPE (stmt_info) = call_simd_clone_vec_info_type;
   DUMP_VECT_SCOPE ("vectorizable_simd_clone_call");
 /*  vect_model_simple_cost (vinfo, stmt_info, ncopies,


Re: [PATCH 8/8] aarch64: Add SVE support for simd clones [PR 96342]

2023-10-18 Thread Andre Vieira (lists)

Rebased, no major changes, still needs review.

On 30/08/2023 10:19, Andre Vieira (lists) via Gcc-patches wrote:
This patch finalizes adding support for the generation of SVE simd 
clones when no simdlen is provided, following the ABI rules where the 
widest data type determines the minimum amount of elements in a length 
agnostic vector.


gcc/ChangeLog:

     * config/aarch64/aarch64-protos.h (add_sve_type_attribute): 
Declare.
 * config/aarch64/aarch64-sve-builtins.cc (add_sve_type_attribute): 
Make

 visibility global.
 * config/aarch64/aarch64.cc (aarch64_fntype_abi): Ensure SVE ABI is
 chosen over SIMD ABI if a SVE type is used in return or arguments.
 (aarch64_simd_clone_compute_vecsize_and_simdlen): Create VLA simd 
clone

 when no simdlen is provided, according to ABI rules.
 (aarch64_simd_clone_adjust): Add '+sve' attribute to SVE simd clones.
 (aarch64_simd_clone_adjust_ret_or_param): New.
 (TARGET_SIMD_CLONE_ADJUST_RET_OR_PARAM): Define.
 * omp-simd-clone.cc (simd_clone_mangle): Print 'x' for VLA simdlen.
 (simd_clone_adjust): Adapt safelen check to be compatible with VLA
 simdlen.

gcc/testsuite/ChangeLog:

 * c-c++-common/gomp/declare-variant-14.c: Adapt aarch64 scan.
 * gfortran.dg/gomp/declare-variant-14.f90: Likewise.
 * gcc.target/aarch64/declare-simd-1.c: Remove warning checks where no
 longer necessary.
 * gcc.target/aarch64/declare-simd-2.c: Add SVE clone scan.diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
60a55f4bc1956786ea687fc7cad7ec9e4a84e1f0..769d637f63724a7f0044f48f3dd683e0fb46049c
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1005,6 +1005,8 @@ namespace aarch64_sve {
 #ifdef GCC_TARGET_H
   bool verify_type_context (location_t, type_context_kind, const_tree, bool);
 #endif
+ void add_sve_type_attribute (tree, unsigned int, unsigned int,
+ const char *, const char *);
 }
 
 extern void aarch64_split_combinev16qi (rtx operands[3]);
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index 
161a14edde7c9fb1b13b146cf50463e2d78db264..6f99c438d10daa91b7e3b623c995489f1a8a0f4c
 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -569,14 +569,16 @@ static bool reported_missing_registers_p;
 /* Record that TYPE is an ABI-defined SVE type that contains NUM_ZR SVE vectors
and NUM_PR SVE predicates.  MANGLED_NAME, if nonnull, is the ABI-defined
mangling of the type.  ACLE_NAME is the  name of the type.  */
-static void
+void
 add_sve_type_attribute (tree type, unsigned int num_zr, unsigned int num_pr,
const char *mangled_name, const char *acle_name)
 {
   tree mangled_name_tree
 = (mangled_name ? get_identifier (mangled_name) : NULL_TREE);
+  tree acle_name_tree
+= (acle_name ? get_identifier (acle_name) : NULL_TREE);
 
-  tree value = tree_cons (NULL_TREE, get_identifier (acle_name), NULL_TREE);
+  tree value = tree_cons (NULL_TREE, acle_name_tree, NULL_TREE);
   value = tree_cons (NULL_TREE, mangled_name_tree, value);
   value = tree_cons (NULL_TREE, size_int (num_pr), value);
   value = tree_cons (NULL_TREE, size_int (num_zr), value);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
37507f091c2a6154fa944c3a9fad6a655ab5d5a1..cb0947b18c6a611d55579b5b08d93f6a4a9c3b2c
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -4080,13 +4080,13 @@ aarch64_takes_arguments_in_sve_regs_p (const_tree 
fntype)
 static const predefined_function_abi &
 aarch64_fntype_abi (const_tree fntype)
 {
-  if (lookup_attribute ("aarch64_vector_pcs", TYPE_ATTRIBUTES (fntype)))
-return aarch64_simd_abi ();
-
   if (aarch64_returns_value_in_sve_regs_p (fntype)
   || aarch64_takes_arguments_in_sve_regs_p (fntype))
 return aarch64_sve_abi ();
 
+  if (lookup_attribute ("aarch64_vector_pcs", TYPE_ATTRIBUTES (fntype)))
+return aarch64_simd_abi ();
+
   return default_function_abi;
 }
 
@@ -27467,7 +27467,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
int num, bool explicit_p)
 {
   tree t, ret_type;
-  unsigned int nds_elt_bits;
+  unsigned int nds_elt_bits, wds_elt_bits;
   int count;
   unsigned HOST_WIDE_INT const_simdlen;
 
@@ -27513,10 +27513,14 @@ aarch64_simd_clone_compute_vecsize_and_simdlen 
(struct cgraph_node *node,
   if (TREE_CODE (ret_type) != VOID_TYPE)
 {
   nds_elt_bits = lane_size (SIMD_CLONE_ARG_TYPE_VECTOR, ret_type);
+  wds_elt_bits = nds_elt_bits;
   vec_elts.safe_push (std::make_pair (ret_type, nds_elt_bits));
 }
   else
-nds_elt_bits = POINTER_SIZE;
+{
+  nds_elt_bits = POINTER_SIZE;
+  wds_elt_bits = 0;
+}
 
   int i;
   tre

[PATCH6/8] omp: Reorder call for TARGET_SIMD_CLONE_ADJUST (was Re: [PATCH7/8] vect: Add TARGET_SIMD_CLONE_ADJUST_RET_OR_PARAM)

2023-10-18 Thread Andre Vieira (lists)
This patch moves the call to TARGET_SIMD_CLONE_ADJUST until after the 
arguments and return types have been transformed into vector types.  It 
also constructs the adjuments and retval modifications after this call, 
allowing targets to alter the types of the arguments and return of the 
clone prior to the modifications to the function definition.


Is this OK?

gcc/ChangeLog:

* omp-simd-clone.cc (simd_clone_adjust_return_type): Hoist out
code to create return array and don't return new type.
(simd_clone_adjust_argument_types): Hoist out code that creates
ipa_param_body_adjustments and don't return them.
(simd_clone_adjust): Call TARGET_SIMD_CLONE_ADJUST after return
and argument types have been vectorized, create adjustments and
return array after the hook.
(expand_simd_clones): Call TARGET_SIMD_CLONE_ADJUST after return
and argument types have been vectorized.

On 04/10/2023 13:40, Andre Vieira (lists) wrote:



On 04/10/2023 11:41, Richard Biener wrote:

On Wed, 4 Oct 2023, Andre Vieira (lists) wrote:




On 30/08/2023 14:04, Richard Biener wrote:

On Wed, 30 Aug 2023, Andre Vieira (lists) wrote:

This patch adds a new target hook to enable us to adapt the types 
of return
and parameters of simd clones.  We use this in two ways, the first 
one is

to
make sure we can create valid SVE types, including the SVE type 
attribute,
when creating a SVE simd clone, even when the target options do not 
support
SVE.  We are following the same behaviour seen with x86 that 
creates simd
clones according to the ABI rules when no simdlen is provided, even 
if that
simdlen is not supported by the current target options.  Note that 
this

doesn't mean the simd clone will be used in auto-vectorization.


You are not documenting the bool parameter of the new hook.

What's wrong with doing the adjustment in TARGET_SIMD_CLONE_ADJUST?


simd_clone_adjust_argument_types is called after that hook, so by the 
time we
call TARGET_SIMD_CLONE_ADJUST the types are still in scalar, not 
vector.  The

same is true for the return type one.

Also the changes to the types need to be taken into consideration in
'adjustments' I think.


Nothing in the three existing implementations of TARGET_SIMD_CLONE_ADJUST
relies on this ordering I think, how about moving the hook invocation
after simd_clone_adjust_argument_types?



But that wouldn't change the 'ipa_param_body_adjustments' for when we 
have a function definition and we need to redo the body.

Richard.

PS: I hope the subject line survived, my email client is having a bit 
of a

wobble this morning... it's what you get for updating software :(diff --git a/gcc/omp-simd-clone.cc b/gcc/omp-simd-clone.cc
index 
ef0b9b48c7212900023bc0eaebca5e1f9389db77..fb80888190c88e29895ecfbbe1b17d390c9a9dfe
 100644
--- a/gcc/omp-simd-clone.cc
+++ b/gcc/omp-simd-clone.cc
@@ -701,10 +701,9 @@ simd_clone_create (struct cgraph_node *old_node, bool 
force_local)
 }
 
 /* Adjust the return type of the given function to its appropriate
-   vector counterpart.  Returns a simd array to be used throughout the
-   function as a return value.  */
+   vector counterpart.  */
 
-static tree
+static void
 simd_clone_adjust_return_type (struct cgraph_node *node)
 {
   tree fndecl = node->decl;
@@ -714,7 +713,7 @@ simd_clone_adjust_return_type (struct cgraph_node *node)
 
   /* Adjust the function return type.  */
   if (orig_rettype == void_type_node)
-return NULL_TREE;
+return;
   t = TREE_TYPE (TREE_TYPE (fndecl));
   if (INTEGRAL_TYPE_P (t) || POINTER_TYPE_P (t))
 veclen = node->simdclone->vecsize_int;
@@ -737,24 +736,6 @@ simd_clone_adjust_return_type (struct cgraph_node *node)
veclen));
 }
   TREE_TYPE (TREE_TYPE (fndecl)) = t;
-  if (!node->definition)
-return NULL_TREE;
-
-  t = DECL_RESULT (fndecl);
-  /* Adjust the DECL_RESULT.  */
-  gcc_assert (TREE_TYPE (t) != void_type_node);
-  TREE_TYPE (t) = TREE_TYPE (TREE_TYPE (fndecl));
-  relayout_decl (t);
-
-  tree atype = build_array_type_nelts (orig_rettype,
-  node->simdclone->simdlen);
-  if (maybe_ne (veclen, node->simdclone->simdlen))
-return build1 (VIEW_CONVERT_EXPR, atype, t);
-
-  /* Set up a SIMD array to use as the return value.  */
-  tree retval = create_tmp_var_raw (atype, "retval");
-  gimple_add_tmp_var (retval);
-  return retval;
 }
 
 /* Each vector argument has a corresponding array to be used locally
@@ -788,7 +769,7 @@ create_tmp_simd_array (const char *prefix, tree type, 
poly_uint64 simdlen)
declarations will be remapped.  New arguments which are not to be remapped
are marked with USER_FLAG.  */
 
-static ipa_param_body_adjustments *
+static void
 simd_clone_adjust_argument_types (struct cgraph_node *node)
 {
   auto_vec args;
@@ -79

Re: Add support for dumping multiple dump files under one name

2018-07-03 Thread Andre Vieira (lists)
On 29/06/18 11:13, David Malcolm wrote:
> On Fri, 2018-06-29 at 10:15 +0200, Richard Biener wrote:
>> On Fri, 22 Jun 2018, Jan Hubicka wrote:
>>
>>> Hi,
>>> this patch adds dumpfile support for dumps that come in multiple
>>> parts.  This
>>> is needed for WPA stream-out dump since we stream partitions in
>>> parallel and
>>> the dumps would come up in random order.  Parts are added by new
>>> parameter that
>>> is initialzed by default to -1 (no parts). 
>>>
>>> One thing I skipped is any support for duplicated opening of file
>>> with parts since I do not need it.
>>>
>>> Bootstrapped/regtested x86_64-linux, OK?
>>
>> Looks reasonable - David, anything you want to add / have changed?
> 
> No worries from my side; I don't think it interacts with the
> optimization records stuff I'm working on - presumably this is just for
> dumping the WPA stream-out, rather than for dumping specific
> optimizations.
> 
> [...snip...]
> 
> Dave
> 

Hi David,

I believe r262245 is causing the following failures on aarch64 and arm:

FAIL: gcc.dg/vect/slp-perm-1.c -flto -ffat-lto-objects  scan-tree-dump
vect "note: Built SLP cancelled: can use load/store-lanes"
FAIL: gcc.dg/vect/slp-perm-1.c scan-tree-dump vect "note: Built SLP
cancelled: can use load/store-lanes"
FAIL: gcc.dg/vect/slp-perm-2.c -flto -ffat-lto-objects  scan-tree-dump
vect "note: Built SLP cancelled: can use load/store-lanes"
FAIL: gcc.dg/vect/slp-perm-2.c scan-tree-dump vect "note: Built SLP
cancelled: can use load/store-lanes"
FAIL: gcc.dg/vect/slp-perm-3.c -flto -ffat-lto-objects  scan-tree-dump
vect "note: Built SLP cancelled: can use load/store-lanes"
FAIL: gcc.dg/vect/slp-perm-3.c scan-tree-dump vect "note: Built SLP
cancelled: can use load/store-lanes"
FAIL: gcc.dg/vect/slp-perm-5.c -flto -ffat-lto-objects  scan-tree-dump
vect "note: Built SLP cancelled: can use load/store-lanes"
FAIL: gcc.dg/vect/slp-perm-5.c scan-tree-dump vect "note: Built SLP
cancelled: can use load/store-lanes"
FAIL: gcc.dg/vect/slp-perm-6.c -flto -ffat-lto-objects  scan-tree-dump
vect "note: Built SLP cancelled: can use load/store-lanes"
FAIL: gcc.dg/vect/slp-perm-6.c scan-tree-dump vect "note: Built SLP
cancelled: can use load/store-lanes"
FAIL: gcc.dg/vect/slp-perm-7.c -flto -ffat-lto-objects  scan-tree-dump
vect "note: Built SLP cancelled: can use load/store-lanes"
FAIL: gcc.dg/vect/slp-perm-7.c scan-tree-dump vect "note: Built SLP
cancelled: can use load/store-lanes"
FAIL: gcc.dg/vect/slp-perm-8.c -flto -ffat-lto-objects  scan-tree-dump
vect "note: Built SLP cancelled: can use load/store-lanes"
FAIL: gcc.dg/vect/slp-perm-8.c scan-tree-dump vect "note: Built SLP
cancelled: can use load/store-lanes"

Could you please have a look?

Cheers,
Andre



Re: Add support for dumping multiple dump files under one name

2018-07-03 Thread Andre Vieira (lists)
On 03/07/18 15:15, David Malcolm wrote:
> On Tue, 2018-07-03 at 11:00 +0100, Andre Vieira (lists) wrote:
>> On 29/06/18 11:13, David Malcolm wrote:
>>> On Fri, 2018-06-29 at 10:15 +0200, Richard Biener wrote:
>>>> On Fri, 22 Jun 2018, Jan Hubicka wrote:
>>>>
>>>>> Hi,
>>>>> this patch adds dumpfile support for dumps that come in
>>>>> multiple
>>>>> parts.  This
>>>>> is needed for WPA stream-out dump since we stream partitions in
>>>>> parallel and
>>>>> the dumps would come up in random order.  Parts are added by
>>>>> new
>>>>> parameter that
>>>>> is initialzed by default to -1 (no parts). 
>>>>>
>>>>> One thing I skipped is any support for duplicated opening of
>>>>> file
>>>>> with parts since I do not need it.
>>>>>
>>>>> Bootstrapped/regtested x86_64-linux, OK?
>>>>
>>>> Looks reasonable - David, anything you want to add / have
>>>> changed?
>>>
>>> No worries from my side; I don't think it interacts with the
>>> optimization records stuff I'm working on - presumably this is just
>>> for
>>> dumping the WPA stream-out, rather than for dumping specific
>>> optimizations.
>>>
>>> [...snip...]
>>>
>>> Dave
>>>
>>
>> Hi David,
>>
>> I believe r262245 is causing the following failures on aarch64 and
>> arm:
>>
>> FAIL: gcc.dg/vect/slp-perm-1.c -flto -ffat-lto-objects  scan-tree-
>> dump
>> vect "note: Built SLP cancelled: can use load/store-lanes"
>> FAIL: gcc.dg/vect/slp-perm-1.c scan-tree-dump vect "note: Built SLP
>> cancelled: can use load/store-lanes"
>> FAIL: gcc.dg/vect/slp-perm-2.c -flto -ffat-lto-objects  scan-tree-
>> dump
>> vect "note: Built SLP cancelled: can use load/store-lanes"
>> FAIL: gcc.dg/vect/slp-perm-2.c scan-tree-dump vect "note: Built SLP
>> cancelled: can use load/store-lanes"
>> FAIL: gcc.dg/vect/slp-perm-3.c -flto -ffat-lto-objects  scan-tree-
>> dump
>> vect "note: Built SLP cancelled: can use load/store-lanes"
>> FAIL: gcc.dg/vect/slp-perm-3.c scan-tree-dump vect "note: Built SLP
>> cancelled: can use load/store-lanes"
>> FAIL: gcc.dg/vect/slp-perm-5.c -flto -ffat-lto-objects  scan-tree-
>> dump
>> vect "note: Built SLP cancelled: can use load/store-lanes"
>> FAIL: gcc.dg/vect/slp-perm-5.c scan-tree-dump vect "note: Built SLP
>> cancelled: can use load/store-lanes"
>> FAIL: gcc.dg/vect/slp-perm-6.c -flto -ffat-lto-objects  scan-tree-
>> dump
>> vect "note: Built SLP cancelled: can use load/store-lanes"
>> FAIL: gcc.dg/vect/slp-perm-6.c scan-tree-dump vect "note: Built SLP
>> cancelled: can use load/store-lanes"
>> FAIL: gcc.dg/vect/slp-perm-7.c -flto -ffat-lto-objects  scan-tree-
>> dump
>> vect "note: Built SLP cancelled: can use load/store-lanes"
>> FAIL: gcc.dg/vect/slp-perm-7.c scan-tree-dump vect "note: Built SLP
>> cancelled: can use load/store-lanes"
>> FAIL: gcc.dg/vect/slp-perm-8.c -flto -ffat-lto-objects  scan-tree-
>> dump
>> vect "note: Built SLP cancelled: can use load/store-lanes"
>> FAIL: gcc.dg/vect/slp-perm-8.c scan-tree-dump vect "note: Built SLP
>> cancelled: can use load/store-lanes"
>>
>> Could you please have a look?
> 
> Sorry about this; I think my r262246 ("dumpfile.c: add indentation via
> DUMP_VECT_SCOPE") caused this.
> 
> Does https://gcc.gnu.org/ml/gcc-patches/2018-07/msg00122.html help?

Yes it does. Thank you!

Andre
> 
> Dave
> 



  1   2   3   4   5   6   7   8   >