On 23/06/2023 11:23, Andre Vieira (lists) wrote:
+  if (insn != arm_mve_get_loop_vctp (body))
+    {

probably a good idea to invert the condition here and return false, helps reducing the indenting in this function.

Done, thanks




+    /* Starting from the current insn, scan backwards through the insn
+       chain until BB_HEAD: "for each insn in the BB prior to the current".
+    */

There's a trailing whitespace after insn, but also I'd rewrite this bit. The "for each insn in the BB prior to the current" is superfluous and even confusing to me. How about: "Scan backwards from the current INSN through the instruction chain until the start of the basic block.  "
Yes, agreed, it wasn't very clear. Done.


 I find 'that previous insn' to be confusing as you don't mention any previous insn before. So how about something along the lines of: 'If a previous insn defines a register that INSN uses then return true if...'
Done


Do we need to check: 'insn != prev_insn' ? Any reason why you can't start the loop with:
'for (rtx_insn *prev_insn = PREV_INSN (insn);'

True! Done.


Now I also found a case where things might go wrong in:
+        /* Look at all the DEFs of that previous insn: if one of them is on +           the same REG as our current insn, then recurse in order to check
+           that insn's USEs.  If any of these insns return true as
+           MVE_VPT_UNPREDICATED_INSN_Ps, then the whole chain is affected +           by the change in behaviour from being placed in dlstp/letp loop.
+        */
+        df_ref prev_insn_defs = NULL;
+        FOR_EACH_INSN_DEF (prev_insn_defs, prev_insn)
+          {
+        if (DF_REF_REGNO (insn_uses) == DF_REF_REGNO (prev_insn_defs)
+            && insn != prev_insn
+            && body == BLOCK_FOR_INSN (prev_insn)
+            && !arm_mve_vec_insn_is_predicated_with_this_predicate
+             (insn, vctp_vpr_generated)
+            && arm_mve_check_df_chain_back_for_implic_predic
+             (prev_insn, vctp_vpr_generated))
+          return true;
+          }

The body == BLOCK_FOR_INSN (prev_insn) hinted me at it, if a def comes from outside of the BB (so outside of the loop's body) then its by definition unpredicated by vctp.  I think you want to check that if prev_insn defines a register used by insn then return true if prev_insn isn't in the same BB or has a chain that is not predicated, i.e.: '!arm_mve_vec_insn_is_predicated_with_this_predicate (insn, vctp_vpr_generated) && arm_mve_check_df_chain_back_for_implic_predic prev_insn, vctp_vpr_generated))' you check body != BLOCK_FOR_INSN (prev_insn)'

Yes, you're right, this is vulnerable here. A neater fix to this (I think?) is to make the above REGNO_REG_SET_P more generic, so that it covers all scalar values and scalar ops, as well. Then it's a "if this insn in the loop has any input that originates outside the bb, then it's unsafe" check and the recursive loop backwards is only for the recursive "are any previous insns unsafe"




I also found some other issues, this currently loloops:

uint16_t  test (uint16_t *a, int n)
{
  uint16_t res =0;
  while (n > 0)
    {
      mve_pred16_t p = vctp16q (n);
      uint16x8_t va = vldrhq_u16 (a);
      res = vaddvaq_u16 (res, va);
      res = vaddvaq_p_u16 (res, va, p);
      a += 8;
      n -= 8;
    }
  return res;
}

But it shouldn't, this is because there's a lack of handling of across vector instructions. Luckily in MVE all across vector instructions have the side-effect that they write to a scalar register, even the vshlcq instruction (it writes to a scalar carry output).

Added support for them (you were right, there was some special handling needed!)



Did this lead me to find an ICE with:

uint16x8_t  test (uint16_t *a, int n)
{
  uint16x8_t res = vdupq_n_u16 (0);
  while (n > 0)
    {
      uint16_t carry = 0;
      mve_pred16_t p = vctp16q (n);
      uint16x8_t va = vldrhq_u16 (a);
      res = vshlcq_u16 (va, &carry, 1);
      res = vshlcq_m_u16 (res, &carry, 1 , p);
      a += 8;
      n -= 8;
    }
  return res;
}

This is because:
+          /* If the USE is outside the loop body bb, or it is inside, but
+         is an unpredicated store to memory.  */
+          if (BLOCK_FOR_INSN (insn) != BLOCK_FOR_INSN (next_use_insn)
+         || (arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate
+             (next_use_insn, vctp_vpr_generated)
+            && mve_memory_operand
+            (SET_DEST (single_set (next_use_insn)),
+             GET_MODE (SET_DEST (single_set (next_use_insn))))))
+        return true;

Assumes single_set doesn't return 0.

Thanks! That is indeed correct.

Corrected this by having a utility function to scan insn operands and check against mve_memory_operand that supports any number of operands/SETs in the insn


Let's deal with these issues and I'll continue to review.

On 15/06/2023 12:47, Stamatis Markianos-Wright via Gcc-patches wrote:
     Hi all,

     This is the 2/2 patch that contains the functional changes needed
     for MVE Tail Predicated Low Overhead Loops.  See my previous email
     for a general introduction of MVE LOLs.

     This support is added through the already existing loop-doloop
     mechanisms that are used for non-MVE dls/le looping.

     Mid-end changes are:

     1) Relax the loop-doloop mechanism in the mid-end to allow for
        decrement numbers other that -1 and for `count` to be an
        rtx containing a simple REG (which in this case will contain
        the number of elements to be processed), rather
        than an expression for calculating the number of iterations.
     2) Added a new df utility function: `df_bb_regno_only_def_find` that         will return the DEF of a REG only if it is DEF-ed once within the
        basic block.

     And many things in the backend to implement the above optimisation:

     3)  Implement the `arm_predict_doloop_p` target hook to instruct the
         mid-end about Low Overhead Loops (MVE or not), as well as
         `arm_loop_unroll_adjust` which will prevent unrolling of any loops          that are valid for becoming MVE Tail_Predicated Low Overhead Loops          (unrolling can transform a loop in ways that invalidate the dlstp/
         letp tranformation logic and the benefit of the dlstp/letp loop
         would be considerably higher than that of unrolling)
     4)  Appropriate changes to the define_expand of doloop_end, new
         patterns for dlstp and letp, new iterators,  unspecs, etc.
     5) `arm_mve_loop_valid_for_dlstp` and a number of checking functions:
        * `arm_mve_dlstp_check_dec_counter`
        * `arm_mve_dlstp_check_inc_counter`
        * `arm_mve_check_reg_origin_is_num_elems`
        * `arm_mve_check_df_chain_back_for_implic_predic`
        * `arm_mve_check_df_chain_fwd_for_implic_predic_impact`
        This all, in smoe way or another, are running checks on the loop
        structure in order to determine if the loop is valid for dlstp/letp
        transformation.
     6) `arm_attempt_dlstp_transform`: (called from the define_expand of
         doloop_end) this function re-checks for the loop's suitability for
         dlstp/letp transformation and then implements it, if possible.
     7) Various utility functions:
        *`arm_mve_get_vctp_lanes` to map
        from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
        to check an insn to see if it requires the VPR or not.
        * `arm_mve_get_loop_vctp`
        * `arm_mve_get_vctp_lanes`
        * `arm_emit_mve_unpredicated_insn_to_seq`
        * `arm_get_required_vpr_reg`
        * `arm_get_required_vpr_reg_param`
        * `arm_get_required_vpr_reg_ret_val`
        * `arm_mve_vec_insn_is_predicated_with_this_predicate`
        * `arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate`

     No regressions on arm-none-eabi with various targets and on
     aarch64-none-elf. Thoughts on getting this into trunk?

     Thank you,
     Stam Markianos-Wright

     gcc/ChangeLog:

             * config/arm/arm-protos.h (arm_target_insn_ok_for_lob): Rename to...
             (arm_target_bb_ok_for_lob): ...this
             (arm_attempt_dlstp_transform): New.
             * config/arm/arm.cc (TARGET_LOOP_UNROLL_ADJUST): New.
             (TARGET_PREDICT_DOLOOP_P): New.
             (arm_block_set_vect):
             (arm_target_insn_ok_for_lob): Rename from arm_target_insn_ok_for_lob.
             (arm_target_bb_ok_for_lob): New.
             (arm_mve_get_vctp_lanes): New.
             (arm_get_required_vpr_reg): New.
             (arm_get_required_vpr_reg_param): New.
             (arm_get_required_vpr_reg_ret_val): New.
             (arm_mve_get_loop_vctp): New.
(arm_mve_vec_insn_is_unpredicated_or_uses_other_predicate): New.
(arm_mve_vec_insn_is_predicated_with_this_predicate): New.
             (arm_mve_check_df_chain_back_for_implic_predic): New.
(arm_mve_check_df_chain_fwd_for_implic_predic_impact): New.
             (arm_mve_check_reg_origin_is_num_elems): New.
             (arm_mve_dlstp_check_inc_counter): New.
             (arm_mve_dlstp_check_dec_counter): New.
             (arm_mve_loop_valid_for_dlstp): New.
             (arm_predict_doloop_p): New.
             (arm_loop_unroll_adjust): New.
             (arm_emit_mve_unpredicated_insn_to_seq): New.
             (arm_attempt_dlstp_transform): New.
             * config/arm/iterators.md (DLSTP): New.
             (mode1): Add DLSTP mappings.
             * config/arm/mve.md (*predicated_doloop_end_internal): New.
             (dlstp<mode1>_insn): New.
             * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
             * config/arm/unspecs.md: New unspecs.
             * df-core.cc (df_bb_regno_only_def_find): New.
             * df.h (df_bb_regno_only_def_find): New.
             * loop-doloop.cc (doloop_condition_get): Relax conditions.
             (doloop_optimize): Add support for elementwise LoLs.

     gcc/testsuite/ChangeLog:

             * gcc.target/arm/lob.h: Update framework.
             * gcc.target/arm/lob1.c: Likewise.
             * gcc.target/arm/lob6.c: Likewise.
             * gcc.target/arm/mve/dlstp-compile-asm.c: New test.
             * gcc.target/arm/mve/dlstp-int16x8.c: New test.
             * gcc.target/arm/mve/dlstp-int32x4.c: New test.
             * gcc.target/arm/mve/dlstp-int64x2.c: New test.
             * gcc.target/arm/mve/dlstp-int8x16.c: New test.
             * gcc.target/arm/mve/dlstp-invalid-asm.c: New test.

Reply via email to