On 7/11/23 00:38, juzhe.zh...@rivai.ai wrote:
From: Ju-Zhe Zhong <juzhe.zh...@rivai.ai>

This patch is to recognize specific permutation pattern which can be applied 
compress approach.

Consider this following case:
#include <stdint.h>
typedef int8_t vnx64i __attribute__ ((vector_size (64)));
#define MASK_64                                                                \
   1, 2, 3, 5, 7, 9, 10, 11, 12, 14, 15, 17, 19, 21, 22, 23, 26, 28, 30, 31,    
\
     37, 38, 41, 46, 47, 53, 54, 55, 60, 61, 62, 63, 76, 77, 78, 79, 80, 81,    
\
     82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,    
\
     100, 101, 102, 103, 104, 105, 106, 107
void __attribute__ ((noinline, noclone)) test_1 (int8_t *x, int8_t *y, int8_t 
*out)
{
   vnx64i v1 = *(vnx64i*)x;
   vnx64i v2 = *(vnx64i*)y;
   vnx64i v3 = __builtin_shufflevector (v1, v2, MASK_64);
   *(vnx64i*)out = v3;
}

https://godbolt.org/z/P33nev6cW

Before this patch:
         lui     a4,%hi(.LANCHOR0)
         addi    a4,a4,%lo(.LANCHOR0)
         vl4re8.v        v4,0(a4)
         li      a4,64
         vsetvli a5,zero,e8,m4,ta,mu
         vl4re8.v        v20,0(a0)
         vl4re8.v        v16,0(a1)
         vmv.v.x v12,a4
         vrgather.vv     v8,v20,v4
         vmsgeu.vv       v0,v4,v12
         vsub.vv v4,v4,v12
         vrgather.vv     v8,v16,v4,v0.t
         vs4r.v  v8,0(a2)
         ret

After this patch:
        lui     a4,%hi(.LANCHOR0)
        addi    a4,a4,%lo(.LANCHOR0)
        vsetvli a5,zero,e8,m4,ta,ma
        vl4re8.v        v12,0(a1)
        vl4re8.v        v8,0(a0)
        vlm.v   v0,0(a4)
        vslideup.vi     v4,v12,20
        vcompress.vm    v4,v8,v0
        vs4r.v  v4,0(a2)
        ret

gcc/ChangeLog:

         * config/riscv/riscv-protos.h (enum insn_type): Add vcompress 
optimization.
         * config/riscv/riscv-v.cc (emit_vlmax_compress_insn): Ditto.
         (shuffle_compress_patterns): Ditto.
         (expand_vec_perm_const_1): Ditto.

gcc/testsuite/ChangeLog:

         * gcc.target/riscv/rvv/autovec/vls-vlmax/compress-1.c: New test.
         * gcc.target/riscv/rvv/autovec/vls-vlmax/compress-2.c: New test.
         * gcc.target/riscv/rvv/autovec/vls-vlmax/compress-3.c: New test.
         * gcc.target/riscv/rvv/autovec/vls-vlmax/compress-4.c: New test.
         * gcc.target/riscv/rvv/autovec/vls-vlmax/compress-5.c: New test.
         * gcc.target/riscv/rvv/autovec/vls-vlmax/compress-6.c: New test.
         * gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-1.c: New test.
         * gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-2.c: New test.
         * gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-3.c: New test.
         * gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-4.c: New test.
         * gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-5.c: New test.
         * gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-6.c: New test.
I had to look at this a few times, but I think that's because it's been polluted by another vector architecture's handling of compressed vectors. What you're doing looks quite reasonable.

OK for the trunk.

jeff

Reply via email to