Hi Robin, Thanks for your review.
> I'm not sure about these conditions. In particular why we should not apply > cost factor when the loop is not unrolled. We already factor in the iteration > count when costing and just getting rid of a few scalar induction variables > doesn't offset the additional LMUL latency. You're right. I was worried about the impact and tried to limit the change to just fix the PR test case. I'll remove these conditions and apply the cost factor to all cases. > Why +1, +2, +3 when the actual data processed is *2, *4, *8? I'd scale by > that, as usually the latency is also similarly affected. Good point. I understand "cost" is an abstract concept, so I wasn't sure if it should scale with LMUL the same way hardware instruction's latency/throughput does. Also I was worried that 2x/4x/8x might be too aggressive and cause some loops to skip vectorization or pick smaller LMUL wrongly. I'll try 2x/4x/8x, then run testsuite and SPEC2017 to check if it performs better. > As long as we're treating everything the same I wonder if we can just check > if the mode is a vector mode and then apply the LMUL penalty. I would > also rather call it LMUL scaling, a penalty would imply that a uarch is even > slower than what the amount of data processed indicates. Makes sense. I'll rename it to get_lmul_cost_scaling() and move the logic to the end of adjust_stmt_cost(). Regards Zhongyao ------------------Original------------------ From: rdapp.gcc <[email protected]> To: chenzhongyao.hit <[email protected]>;gcc-patches <[email protected]>; Cc: rdapp.gcc <[email protected]>;jeffreyalaw <[email protected]>;kitocheng <[email protected]>;Chen Zhongyao10243956; Date: 2025/11/20 16:34 Subject: Re: [PATCH] RISC-V: Apply LMUL cost penalty to vector operations Hi Zhongyao, sorry for the delay. > +/* Calculate LMUL-based cost adjustment factor. > + Larger LMUL values increase execution overhead. > + > + This penalty is only applied when the loop is completely unrolled. > + Returns additional cost to be added based on LMUL. */ > +static unsigned > +get_lmul_cost_penalty (machine_mode mode, loop_vec_info loop_vinfo) > +{ > + if (!riscv_v_ext_vector_mode_p (mode)) > + return 0; > + > + /* Only apply LMUL penalty when loop is completely unrolled. > + For non-unrolled loops, larger LMUL reduces iteration count, > + which may provide overall benefit despite slower instructions. */ > + if (!loop_vinfo) > + return 0; > + > + /* Check if loop will be completely unrolled: > + - NITERS must be known at compile time > + - NITERS must be less than VF (single iteration) */ > + if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) > + return 0; I'm not sure about these conditions. In particular why we should not apply cost factor when the loop is not unrolled. We already factor in the iteration count when costing and just getting rid of a few scalar induction variables doesn't offset the additional LMUL latency. > + > + poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); > + unsigned HOST_WIDE_INT niters = LOOP_VINFO_INT_NITERS (loop_vinfo); > + > + /* If NITERS >= VF, loop will have multiple iterations. > + In this case, larger LMUL reduces loop count, don't penalize. */ > + if (maybe_ge (poly_uint64 (niters), vf)) > + return 0; > + > + /* Loop is completely unrolled (single iteration). > + Apply LMUL penalty since larger LMUL increases latency. */ > + enum vlmul_type vlmul = get_vlmul (mode); > + > + /* Cost penalty increases with LMUL: > + - m1 (LMUL_1): 0 penalty (baseline) > + - m2 (LMUL_2): +1 > + - m4 (LMUL_4): +2 > + - m8 (LMUL_8): +3 > + - mf2/mf4/mf8: 0 (already efficient) */ > + switch (vlmul) > + { > + case LMUL_2: > + return 1; > + case LMUL_4: > + return 2; > + case LMUL_8: > + return 3; > + case LMUL_1: > + case LMUL_F2: > + case LMUL_F4: > + case LMUL_F8: > + default: > + return 0; Why +1, +2, +3 when the actual data processed is *2, *4, *8? I'd scale by that, as usually the latency is also similarly affected. > + > /* Adjust vectorization cost after calling riscv_builtin_vectorization_cost. > For some statement, we would like to further fine-grain tweak the cost on > top of riscv_builtin_vectorization_cost handling which doesn't have any > @@ -1181,6 +1239,15 @@ costs::adjust_stmt_cost (enum vect_cost_for_stmt kind, > loop_vec_info loop, > default: > break; > } > + > + /* Adjust cost for all segment load/store operations based on > + actual vectype LMUL. Only penalize when loop is completely > + unrolled. */ > + if (vectype) > + { > + machine_mode actual_mode = TYPE_MODE (vectype); > + stmt_cost += get_lmul_cost_penalty (actual_mode, loop); > + } > } > else > { > @@ -1236,10 +1303,29 @@ costs::adjust_stmt_cost (enum vect_cost_for_stmt > kind, loop_vec_info loop, > } > } > } > + > + /* Apply LMUL penalty for unit-stride operations. > + This ensures consistent cost modeling across all > + vector load/store types when loop is unrolled. */ > + if (vectype) > + { > + machine_mode actual_mode = TYPE_MODE (vectype); > + stmt_cost += get_lmul_cost_penalty (actual_mode, loop); > + } > } > break; > } > > + case vector_stmt: > + /* Adjust cost for all vector arithmetic operations based on LMUL. > + Only penalize when loop is completely unrolled. */ > + if (vectype) > + { > + machine_mode actual_mode = TYPE_MODE (vectype); > + stmt_cost += get_lmul_cost_penalty (actual_mode, loop); > + } > + break; As long as we're treating everything the same I wonder if we can just check if the mode is a vector mode and then apply the LMUL penalty. I would also rather call it LMUL scaling, a penalty would imply that a uarch is even slower than what the amount of data processed indicates. -- Regards Robin
