Re: [PATCH] RISC-V: Add pattern for vector-scalar multiply-add/sub [PR119100]

Paul-Antoine Arras Wed, 16 Apr 2025 07:42:17 -0700

Hi Jeff, Robin,

Thanks for your comments.


On 30/03/2025 01:30, Jeff Law wrote:

On 3/27/25 1:39 PM, Robin Dapp wrote:
Hi Paul-Antoine,
This pattern enables the combine pass to merge a vec_duplicate into aplus-mult
or minus-mult RTL instruction.

Before this patch, we have two instructions, e.g.:
  vfmv.v.f        v6,fa0
  vfmadd.vv       v9,v6,v7

After, we get only one:
  vfmadd.vf       v9,fa0,v7
On SPEC2017's 503.bwaves_r, depending on the workload, the reductionin dynamic
instruction count varies from -4.66% to -4.75%.
The general issue with this kind of optimization (we have discussed ita few times already) is that, depending on the uarch, we want thelocal combine optimization that you show but not the fwprop/late-combine one where we propagate a vector broadcast into a loop.
So IMHO in order to continue with this and similar patterns we need atleast accompanying rtx_cost handling that would allow us to tune peruarch.

Please find attached an updated patch with an additional cost model. Bydefault, an instruction is 4 and the penalty for moving data fromfloating-point to vector register is 2; thus, vfmadd.vf costs 6, whichstill makes it cheaper than vec_duplicate + vfmadd.vv. Different tuningparameters can alter this tradeoff though.

Pan Li sent a similar patch for vadd.vv/vadd.vx I think in Novemberand I believe he intended to continue when stage 1 opens.
An outstanding question is how to distinguish the combine case fromthe late-combine case. I haven't yet thought about that in detail.

I experienced with a few variations around the testcase of the PR. Whenthe loop body shrinks, the lower register pressure allows thevec_duplicate to be hoisted to the loop preamble. In such cases, I didnot observe that the vec_duplicate got propagated into the loop body.

The other thing we should consider is that we can certainly theorizethat this kind of register file crossing case can have an extra penalty(it traditionally does), we don't have actual evidence that it's causinga problem on any RISC-V designs.
So may be the way to go is add a field to the uarch tuning structureindicating the additional cost (if any) of a register file crossingvector op of this nature. Then query that in riscv_rtx_costs orwhatever our rtx_cost function is named.
Default that additional cost to zero initially. Then uarch experts canfill in the appropriate value. Yea, such a simplistic approach wouldn'thandle cases like ours where you really need nearby context to be sure,but I don't think we want to over-engineer this solution too badly rightnow.
Note that since this isn't a regression it'll need to wait for gcc-16development to open before the patch can go forward.
Thanks!
JEff
ps. I know Baylibre's remit was to test dynamic icounts and there weregood reasons for that. So don't worry about not having run it ondesign. If you happen to still have executables, pass them alongprivately, I can run them on a BPI. ROMS is a few hours of runtime, butthat's not a big deal.

We recently received our own BPI board, so I was able to run503.bwaves_r on it. Unfortunately, the DIC reduction does not translateinto similar execution time gains. The vector-scalar is only faster by0.33% on average over 3 iterations.


Thanks,
--
PA

commit b0f1dbf8b4ad12c0eff459d0bf6b3d9c466fd5ad
Author: Paul-Antoine Arras <par...@baylibre.com>
Date:   Tue Feb 25 16:38:54 2025 +0100

    RISC-V: Add pattern for vector-scalar multiply-add/sub [PR119100]
    
    This pattern enables the combine pass to merge a vec_duplicate into a plus-mult
    or minus-mult RTL instruction.
    
    Before this patch, we have two instructions, e.g.:
      vfmv.v.f        v6,fa0
      vfmadd.vv       v9,v6,v7
    
    After, we get only one:
      vfmadd.vf       v9,fa0,v7
    
    On SPEC2017's 503.bwaves_r, depending on the workload, the reduction in dynamic
    instruction count varies from -4.66% to -4.75%.
    
    gcc/ChangeLog:
    
            PR target/119100
            * config/riscv/vector.md (*pred_<madd_msub><mode>_scalar): Define.
            * config/riscv/riscv.cc (riscv_rtx_costs): Add cost for moving data from
            a scalar floating-point to a vector register.

diff --git gcc/config/riscv/riscv.cc gcc/config/riscv/riscv.cc
index 38f3ae7cd84..0f0cf04bdd9 100644
--- gcc/config/riscv/riscv.cc
+++ gcc/config/riscv/riscv.cc
@@ -3864,6 +3864,18 @@ riscv_rtx_costs (rtx x, machine_mode mode, int outer_code, int opno ATTRIBUTE_UN
   if (riscv_v_ext_mode_p (mode))
     {
       *total = COSTS_N_INSNS (1);
+      if ((GET_CODE (x) == PLUS || GET_CODE (x) == MINUS) && outer_code == SET)
+	{
+	  rtx plus_op0 = XEXP (x, 0);
+	  if (GET_CODE (plus_op0) == MULT)
+	    {
+	      rtx mult_op0 = XEXP (plus_op0, 0);
+	      if (GET_CODE (mult_op0) == VEC_DUPLICATE)
+		{
+		  *total += get_vector_costs ()->regmove->FR2VR;
+		}
+	    }
+	}
       return true;
     }
 
diff --git gcc/config/riscv/vector.md gcc/config/riscv/vector.md
index 8ee43cf0ce1..6f538eeefda 100644
--- gcc/config/riscv/vector.md
+++ gcc/config/riscv/vector.md
@@ -6633,6 +6633,29 @@ (define_insn "*pred_<madd_msub><mode>_scalar"
    (set (attr "frm_mode")
 	(symbol_ref "riscv_vector::get_frm_mode (operands[9])"))])
 
+(define_insn_and_split "*pred_<madd_msub><mode>_scalar"
+  [(set (match_operand:V_VLSF 0 "register_operand"            "=vd, vr")
+    (plus_minus:V_VLSF
+	    (mult:V_VLSF
+	      (vec_duplicate:V_VLSF
+	        (match_operand:<VEL> 1 "register_operand" "  f,    f"))
+	      (match_operand:V_VLSF 2 "register_operand"      "  0,    0"))
+	    (match_operand:V_VLSF 3 "register_operand"        " vr,   vr")))]
+  "TARGET_VECTOR"
+  "#"
+  "!reload_completed"
+  [(const_int 0)]
+  {
+    rtx ops[] = {operands[0], operands[1], operands[2], operands[3],
+                 operands[2]};
+    riscv_vector::emit_vlmax_insn (code_for_pred_mul_scalar (<CODE>, <MODE>mode),
+                                    riscv_vector::TERNARY_OP_FRM_DYN, ops);
+    DONE;
+  }
+  [(set_attr "type" "vfmuladd")
+   (set_attr "mode" "<MODE>")]
+)
+
 (define_insn "*pred_<macc_msac><mode>_scalar"
   [(set (match_operand:V_VLSF 0 "register_operand"            "=vd, vr")
 	(if_then_else:V_VLSF

Re: [PATCH] RISC-V: Add pattern for vector-scalar multiply-add/sub [PR119100]

Reply via email to