https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125476
Bug ID: 125476
Summary: RISC-V: unexpected rvv prologue cost result
Product: gcc
Version: 17.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: chenzhongyao.hit at gmail dot com
Target Milestone: ---
Target: riscv
After VLS lmul cost scaling patch,
```
diff --git a/gcc/config/riscv/riscv-vector-costs.cc
b/gcc/config/riscv/riscv-vector-costs.cc
index e678e0de766..c003016caeb 100644
--- a/gcc/config/riscv/riscv-vector-costs.cc
+++ b/gcc/config/riscv/riscv-vector-costs.cc
@@ -1245,9 +1245,6 @@ segment_loadstore_group_size (enum vect_cost_for_stmt
kind,
static unsigned
get_lmul_cost_scaling (machine_mode mode)
{
- if (!riscv_vla_mode_p (mode))
- return 1;
-
enum vlmul_type vlmul = get_vlmul (mode);
```
I am seeing what looks like an unexpected cost-model result for RVV mode
selection for the loop below:
and_int8_t1024:
```c
#include <stdint.h>
void and_int8_t1024 (int8_t *restrict a, int8_t *restrict b) {
for (int i = 0; i < 1024; ++i)
a[i] = b[i] & -16;
}
```
with VLEN=4096
**Upstream:**
```assembly
li a5,1024
vsetvli zero,a5,e8,m2,ta,ma
vle8.v v2,0(a1)
vand.vi v2,v2,-16
vse8.v v2,0(a0)
ret
```
**after VLS lmul cost scaling patch:** `m1` and unroll , but with two more
`add`.
```assembly
li a5,512
vsetvli zero,a5,e8,m1,ta,ma
vle8.v v1,0(a1) ...
add a0,a0,a5
add a1,a1,a5
vle8.v v1,0(a1) ...
```
the original `m2` should have lower cost, which is our expectation.
but I checked the `-fdump-tree-vect-details`, it shows:
```
V512QI:
minimal_reproducer.c:6:21: note: Cost model analysis:
Vector inside of loop cost: 3
Vector prologue cost: 1
Vector epilogue cost: 0
Scalar iteration cost: 3
Scalar outside cost: 0
Vector outside cost: 1
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 1
V1024QI:
minimal_reproducer.c:6:21: note: Cost model analysis:
Vector inside of loop cost: 6
Vector prologue cost: 2
Vector epilogue cost: 0
Scalar iteration cost: 3
Scalar outside cost: 0
Vector outside cost: 2
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 1
```
Seems the vector prologue cost causes the m1 chosen.
```
bool
vector_costs::better_main_loop_than_p (const vector_costs *other) const
{
int diff = compare_inside_loop_cost (other);
if (diff != 0)
return diff < 0;
/* If there's nothing to choose between the loop bodies, see whether
there's a difference in the prologue and epilogue costs. */
diff = compare_outside_loop_cost (other);
<---------------------------------here
if (diff != 0)
return diff < 0;
return false;
}
```
the prologue cost is recorded because we assume that the constant -16
must be loaded into a vector before the loop starts.
but in this case, I think both V512QI and V1024QI should have no
prologue cost.