https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125476

            Bug ID: 125476
           Summary: RISC-V: unexpected rvv prologue cost result
           Product: gcc
           Version: 17.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: chenzhongyao.hit at gmail dot com
  Target Milestone: ---
            Target: riscv

After VLS lmul cost scaling patch,
```
diff --git a/gcc/config/riscv/riscv-vector-costs.cc
b/gcc/config/riscv/riscv-vector-costs.cc
index e678e0de766..c003016caeb 100644
--- a/gcc/config/riscv/riscv-vector-costs.cc
+++ b/gcc/config/riscv/riscv-vector-costs.cc
@@ -1245,9 +1245,6 @@ segment_loadstore_group_size (enum vect_cost_for_stmt
kind,
 static unsigned
 get_lmul_cost_scaling (machine_mode mode)
 {
-  if (!riscv_vla_mode_p (mode))
-    return 1;
-
   enum vlmul_type vlmul = get_vlmul (mode);
```

I am seeing what looks like an unexpected cost-model result for RVV mode
selection for the loop below:

and_int8_t1024:
```c
#include <stdint.h>
void and_int8_t1024 (int8_t *restrict a, int8_t *restrict b) {
  for (int i = 0; i < 1024; ++i)
    a[i] = b[i] & -16;
}
```

with VLEN=4096
**Upstream:**
```assembly
        li      a5,1024
        vsetvli zero,a5,e8,m2,ta,ma
        vle8.v  v2,0(a1)
        vand.vi v2,v2,-16
        vse8.v  v2,0(a0)
        ret
```

**after VLS lmul cost scaling patch:** `m1` and unroll , but with two more
`add`.
```assembly
        li      a5,512
        vsetvli zero,a5,e8,m1,ta,ma
        vle8.v  v1,0(a1) ...
        add     a0,a0,a5
        add     a1,a1,a5
        vle8.v  v1,0(a1) ...
```

the original `m2` should have lower cost, which is our expectation.
but I checked the `-fdump-tree-vect-details`, it shows:
```
V512QI:
minimal_reproducer.c:6:21: note: Cost model analysis:
Vector inside of loop cost: 3
Vector prologue cost: 1
Vector epilogue cost: 0
Scalar iteration cost: 3
Scalar outside cost: 0
Vector outside cost: 1
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 1

V1024QI:
minimal_reproducer.c:6:21: note: Cost model analysis:
Vector inside of loop cost: 6
Vector prologue cost: 2
Vector epilogue cost: 0
Scalar iteration cost: 3
Scalar outside cost: 0
Vector outside cost: 2
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 1
```

Seems the vector prologue cost causes the m1 chosen.
```
bool
vector_costs::better_main_loop_than_p (const vector_costs *other) const
{
  int diff = compare_inside_loop_cost (other);
  if (diff != 0)
    return diff < 0;

  /* If there's nothing to choose between the loop bodies, see whether
     there's a difference in the prologue and epilogue costs.  */
  diff = compare_outside_loop_cost (other);
<---------------------------------here
  if (diff != 0)
    return diff < 0;

  return false;
}
```

the prologue cost is recorded because we assume that the constant -16 
must be loaded into a vector before the loop starts.
but in this case, I think both V512QI and V1024QI should have no
prologue cost.
  • [Bug target/125476] New: RI... chenzhongyao.hit at gmail dot com via Gcc-bugs

Reply via email to