Committed, thanks Jeff. Pan
-----Original Message----- From: Gcc-patches <gcc-patches-bounces+pan2.li=intel....@gcc.gnu.org> On Behalf Of Jeff Law via Gcc-patches Sent: Friday, June 2, 2023 2:52 AM To: juzhe.zh...@rivai.ai; gcc-patches@gcc.gnu.org Cc: kito.ch...@gmail.com; kito.ch...@sifive.com; pal...@dabbelt.com; pal...@rivosinc.com; rdapp....@gmail.com Subject: Re: [PATCH] RISC-V: Add vwadd.wv/vwsub.wv auto-vectorization lowering optimization On 5/31/23 21:48, juzhe.zh...@rivai.ai wrote: > From: Juzhe-Zhong <juzhe.zh...@rivai.ai> > > 1. This patch optimize the codegen of the following auto-vectorization codes: > > void foo (int32_t * __restrict a, int64_t * __restrict b, int64_t * > __restrict c, int n) { > for (int i = 0; i < n; i++) > c[i] = (int64_t)a[i] + b[i]; > } > > Combine instruction from: > > ... > vsext.vf2 > vadd.vv > ... > > into: > > ... > vwadd.wv > ... > > Since for PLUS operation, GCC prefer the following RTL operand order when > combining: > > (plus: (sign_extend:..) > (reg:) > > instead of > > (plus: (reg:..) > (sign_extend:) > > which is different from MINUS pattern. Right. Canonicaliation rules will have the sign_extend as the first operand when the opcode is associative. > > I split patterns of vwadd/vwsub, and add dedicated patterns for them. > > 2. This patch not only optimize the case as above (1) mentioned, also enhance > vwadd.vv/vwsub.vv > optimization for complicate PLUS/MINUS codes, consider this following > codes: > > __attribute__ ((noipa)) void > vwadd_int16_t_int8_t (int16_t *__restrict dst, int16_t *__restrict dst2, > int16_t *__restrict dst3, int8_t *__restrict a, > int8_t *__restrict b, int8_t *__restrict a2, > int8_t *__restrict b2, int n) > { > for (int i = 0; i < n; i++) > { > dst[i] = (int16_t) a[i] + (int16_t) b[i]; > dst2[i] = (int16_t) a2[i] + (int16_t) b[i]; > dst3[i] = (int16_t) a2[i] + (int16_t) a[i]; > } > } > > Before this patch: > ... > vsetvli zero,a6,e8,mf2,ta,ma > vle8.v v2,0(a3) > vle8.v v1,0(a4) > vsetvli t1,zero,e16,m1,ta,ma > vsext.vf2 v3,v2 > vsext.vf2 v2,v1 > vadd.vv v1,v2,v3 > vsetvli zero,a6,e16,m1,ta,ma > vse16.v v1,0(a0) > vle8.v v4,0(a5) > vsetvli t1,zero,e16,m1,ta,ma > vsext.vf2 v1,v4 > vadd.vv v2,v1,v2 > ... > > After this patch: > ... > vsetvli zero,a6,e8,mf2,ta,ma > vle8.v v3,0(a4) > vle8.v v1,0(a3) > vsetvli t4,zero,e8,mf2,ta,ma > vwadd.vv v2,v1,v3 > vsetvli zero,a6,e16,m1,ta,ma > vse16.v v2,0(a0) > vle8.v v2,0(a5) > vsetvli t4,zero,e8,mf2,ta,ma > vwadd.vv v4,v3,v2 > vsetvli zero,a6,e16,m1,ta,ma > vse16.v v4,0(a1) > vsetvli t4,zero,e8,mf2,ta,ma > sub a7,a7,a6 > vwadd.vv v3,v2,v1 > vsetvli zero,a6,e16,m1,ta,ma > vse16.v v3,0(a2) > ... > > The reason why current upstream GCC can not optimize codes using vwadd > thoroughly is combine PASS needs intermediate RTL IR (extend one of > the operand pattern (vwadd.wv)), then base on this intermediate RTL IR, > extend the other operand to generate vwadd.vv. > > So vwadd.wv/vwsub.wv definitely helps to vwadd.vv/vwsub.vv code optimizations. > > gcc/ChangeLog: > > * config/riscv/riscv-vector-builtins-bases.cc: Change > vwadd.wv/vwsub.wv intrinsic API expander > * config/riscv/vector.md > (@pred_single_widen_<plus_minus:optab><any_extend:su><mode>): Remove it. > (@pred_single_widen_sub<any_extend:su><mode>): New pattern. > (@pred_single_widen_add<any_extend:su><mode>): New pattern. > > gcc/testsuite/ChangeLog: > > * gcc.target/riscv/rvv/autovec/widen/widen-5.c: New test. > * gcc.target/riscv/rvv/autovec/widen/widen-6.c: New test. > * gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c: New test. > * gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c: New test. > * gcc.target/riscv/rvv/autovec/widen/widen_run-5.c: New test. > * gcc.target/riscv/rvv/autovec/widen/widen_run-6.c: New test. OK jeff