https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #8 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Maciej W. Rozycki from comment #7)
> Thank you for all your explanations.  I think I'm still missing something
> here, so I'll write it differently (and let's ignore the tail-agnostic vs
> tail-undisturbed choice for the purpose of this consideration).
> 
> Let me paste the whole assembly code produced here (sans decorations):
> 
>       beq     a5,zero,.L2
>       vsetvli zero,a6,e32,m1,tu,ma
> .L3:
>       beq     a4,zero,.L7
>       li      a5,0
> .L5:
>       vle32.v v1,0(a0)
>       vle32.v v1,0(a1)
>       vle32.v v1,0(a2)
>       vse32.v v1,0(a3)
>       addi    a5,a5,1
>       bne     a4,a5,.L5
> .L7:
>       ret
> .L2:
>       vsetvli zero,a6,e32,m1,tu,ma
>       j       .L3
> 
> This seems to me to correspond to this source code:
> 
>   if (cond)
>     __riscv_vsetvl_e32m1(avl);
>   else
>     __riscv_vsetvl_e16mf2(avl);
>   for (size_t i = 0; i < n; i += 1) {
>     vint32m1_t a = __riscv_vle32_v_i32m1(in1, avl);
>     vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, avl);
>     vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, avl);
>     __riscv_vse32_v_i32m1(out, c, avl);
>   }
> 
> And in that case I'd expect the conditional to be optimised away, as its
> result is ignored (along with the intrinsics) and does not affect actual
> code executed except for the different execution path, i.e.:
> 
>       beq     a4,zero,.L7
>       vsetvli zero,a6,e32,m1,tu,ma
>       li      a5,0
> .L5:
>       vle32.v v1,0(a0)
>       vle32.v v1,0(a1)
>       vle32.v v1,0(a2)
>       vse32.v v1,0(a3)
>       addi    a5,a5,1
>       bne     a4,a5,.L5
> .L7:
>       ret
> 

Good catch ! I think we have a missed-optimization here and I agree this code
is correct and optimal codegen for this case.
We have a close-to-optimal (not optimal enough) codegen for now.

And this optimization should not be done by VSETVL PASS.

After VSETVL PASS fusion, both e16mf2 and e32m1 user vsetvl instrinsic are
fused into e32m1, tu. They are totally the same so it's meaningless seperate
them into different blocks (They should be the same single block).

The reason why we missed an optimization here is because we expand user
vsetvl __riscv_vsetvl_e32m1 and __riscv_vsetvl_e16mf2 into 2 different
RTL expressions. The before PASSes (before VSETVL) don't known they are
equivalent, so separate them into different blocks.

If you change codes as follows:
  if (cond)
    vl = __riscv_vsetvl_e32m1(avl);
  else
    vl = __riscv_vsetvl_e32m1(avl);

I am sure the codegen will be as you said above. (A single vsetvl e32m1 tu in
a single block).

To optimize it, a alternative approach is that we expand all user vsetvl
instrinscs into same RTL expression (as long as they are having same ratio).


Meaning, expand 

__riscv_vsetvl_e64m1
__riscv_vsetvl_e32m1
__riscv_vsetvl_e16mf2
__riscv_vsetvl_e8mf8

into same RTL expression since their VL outputs are definitely the same.

I don't see it will cause any problems here.

But different ratio like 32m1 and e32mf2 should be different RLT expression.

I am not sure kito agree with this idea.


Another alternative approach is that we enhance bb_reorder PASS.
The VSETVL PASS is run before bb_reorder PASS and current bb_reorder PASS
is unable to fuse these 2 vsetvls e32m1 Tu into same block because we split
it into "real" vsetvls which is the RTL pattern has side effects.

The "real" vsetvl patterns which generate assembly should have side effects
since vsetvl does change global VL/VTYPE status and also set a general
register.

No matter which approach to optimize it, I won't do it in GCC-14 since stage 1
is soon to close.  We have a few more features (which are much more imporant)
that we are planning and working to support in GCC-14.
I have confidence that our RVV GCC current VSETVL PASS is really optimal and
fancy enough.

After stage 1 close, we won't do any optimizations, we will only run full
coverage testing (for example, using different LMUL different -march to run the
whole gcc testsuite) and fix bugs.

Reply via email to