> If we encounter a uarch where the other sequence is better, then I think
> we can do something like query costs or the like and select between the
> approaches -- but no need to do that now.

> So OK for the trunk.


Thanks, patch will be committed soon.




------------------ Original ------------------
From:                                                                           
                                             "Jeff Law"                         
                                                           
<gcc-patches@gcc.gnu.org&gt;;
Date:&nbsp;Sat, Aug 12, 2023 07:02 AM
To:&nbsp;"Lehua 
Ding"<lehua.d...@rivai.ai&gt;;"gcc-patches"<gcc-patches@gcc.gnu.org&gt;;
Cc:&nbsp;"juzhe.zhong"<juzhe.zh...@rivai.ai&gt;;"kito.cheng"<kito.ch...@gmail.com&gt;;"rdapp.gcc"<rdapp....@gmail.com&gt;;"palmer"<pal...@rivosinc.com&gt;;
Subject:&nbsp;Re: [PATCH] RISC-V: Revert the convert from vmv.s.x to vmv.v.i



On 8/11/23 03:01, Lehua Ding wrote:
&gt; Hi,
&gt; 
&gt; This patch revert the convert from vmv.s.x to vmv.v.i and add new pattern
&gt; optimize the special case when the scalar operand is zero.
&gt; 
&gt; Currently, the broadcast pattern where the scalar operand is a imm
&gt; will be converted to vmv.v.i from vmv.s.x and the mask operand will be
&gt; converted from 00..01 to 11..11. There are some advantages and
&gt; disadvantages before and after the conversion after discussing
&gt; with Juzhe offline and we chose not to do this transform.
&gt; 
&gt; Before:
&gt; 
&gt;&nbsp;&nbsp;&nbsp; Advantages: The vsetvli info required by vmv.s.x has 
better compatibility since
&gt;&nbsp;&nbsp;&nbsp; vmv.s.x only required SEW and VLEN be zero or one. That 
mean there
&gt;&nbsp;&nbsp;&nbsp; is more opportunities to combine with other vsetlv infos 
in vsetvl pass.
&gt; 
&gt;&nbsp;&nbsp;&nbsp; Disadvantages: For non-zero scalar imm, one more `li rd, 
imm` instruction
&gt;&nbsp;&nbsp;&nbsp; will be needed.
&gt; 
&gt; After:
&gt; 
&gt;&nbsp;&nbsp;&nbsp; Advantages: No need `li rd, imm` instruction since 
vmv.v.i support imm operand.
&gt; 
&gt;&nbsp;&nbsp;&nbsp; Disadvantages: Like before's advantages. Worse 
compatibility leads to more
&gt;&nbsp;&nbsp;&nbsp; vsetvl instrunctions need.
&gt; 
&gt; Consider the bellow C code and asm after autovec.
&gt; there is an extra insn (vsetivli zero, 1, e32, m1, ta, ma)
&gt; after converted vmv.s.x to vmv.v.i.
&gt; 
&gt; ```
&gt; int foo1(int* restrict a, int* restrict b, int *restrict c, int n) {
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; int sum = 0;
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for (int i = 0; i < n; i++)
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sum += a[i] * b[i];
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return sum;
&gt; }
&gt; ```
&gt; 
&gt; asm (Before):
&gt; 
&gt; ```
&gt; foo1:
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
ble&nbsp;&nbsp;&nbsp;&nbsp; a3,zero,.L7
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; vsetvli 
a2,zero,e32,m1,ta,ma
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; vmv.v.i v1,0
&gt; .L6:
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; vsetvli 
a5,a3,e32,m1,tu,ma
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
slli&nbsp;&nbsp;&nbsp; a4,a5,2
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
sub&nbsp;&nbsp;&nbsp;&nbsp; a3,a3,a5
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; vle32.v v2,0(a0)
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; vle32.v v3,0(a1)
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
add&nbsp;&nbsp;&nbsp;&nbsp; a0,a0,a4
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
add&nbsp;&nbsp;&nbsp;&nbsp; a1,a1,a4
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
vmacc.vv&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; v1,v3,v2
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
bne&nbsp;&nbsp;&nbsp;&nbsp; a3,zero,.L6
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; vsetvli 
a2,zero,e32,m1,ta,ma
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; vmv.s.x v2,zero
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
vredsum.vs&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; v1,v1,v2
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; vmv.x.s a0,v1
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ret
&gt; .L7:
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
li&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a0,0
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ret
&gt; ```
&gt; 
&gt; asm (After):
&gt; 
&gt; ```
&gt; foo1:
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
ble&nbsp;&nbsp;&nbsp;&nbsp; a3,zero,.L4
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; vsetvli 
a2,zero,e32,m1,ta,ma
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; vmv.v.i v1,0
&gt; .L3:
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; vsetvli 
a5,a3,e32,m1,tu,ma
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
slli&nbsp;&nbsp;&nbsp; a4,a5,2
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
sub&nbsp;&nbsp;&nbsp;&nbsp; a3,a3,a5
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; vle32.v v2,0(a0)
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; vle32.v v3,0(a1)
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
add&nbsp;&nbsp;&nbsp;&nbsp; a0,a0,a4
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
add&nbsp;&nbsp;&nbsp;&nbsp; a1,a1,a4
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
vmacc.vv&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; v1,v3,v2
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
bne&nbsp;&nbsp;&nbsp;&nbsp; a3,zero,.L3
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
vsetivli&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; zero,1,e32,m1,ta,ma
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; vmv.v.i v2,0
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; vsetvli 
a2,zero,e32,m1,ta,ma
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
vredsum.vs&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; v1,v1,v2
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; vmv.x.s a0,v1
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ret
&gt; .L4:
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
li&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a0,0
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ret
&gt; ```
&gt; 
&gt; Best,
&gt; Lehua
&gt; 
&gt; Co-Authored-By: Ju-Zhe Zhong <juzhe.zh...@rivai.ai&gt;
&gt; 
&gt; gcc/ChangeLog:
&gt; 
&gt;* config/riscv/predicates.md (vector_const_0_operand): New.
&gt;* config/riscv/vector.md (*pred_broadcast<mode&gt;_zero): Ditto.
&gt; 
&gt; gcc/testsuite/ChangeLog:
&gt; 
&gt;* gcc.target/riscv/rvv/base/scalar_move-5.c: Update.
&gt;* gcc.target/riscv/rvv/base/scalar_move-6.c: Ditto.
If we encounter a uarch where the other sequence is better, then I think 
we can do something like query costs or the like and select between the 
approaches -- but no need to do that now.

So OK for the trunk.
jeff

Reply via email to