https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112431

--- Comment #4 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Pan Li <pa...@gcc.gnu.org>:

https://gcc.gnu.org/g:bdad036da32f72b84a96070518e7d75c21706dc2

commit r14-5960-gbdad036da32f72b84a96070518e7d75c21706dc2
Author: Juzhe-Zhong <juzhe.zh...@rivai.ai>
Date:   Wed Nov 29 16:34:10 2023 +0800

    RISC-V: Support highpart register overlap for vwcvt

    Since Richard supports register filters recently, we are able to support
highpart register
    overlap for widening RVV instructions.

    This patch support it for vwcvt intrinsics.

    I leverage real application user codes for vwcvt:
    https://github.com/riscv/riscv-v-spec/issues/929
    https://godbolt.org/z/xoeGnzd8q

    This is the real application codes that using LMUL = 8 with unrolling to
gain optimal
    performance for specific libraury.

    You can see in the codegen, GCC has optimal codegen for such since we
supported register
    lowpart overlap for narrowing instructions (dest EEW < source EEW).

    Now, we start to support highpart register overlap from this patch for
widening instructions (dest EEW > source EEW).

    Leverage this intrinsic codes above but for vwcvt:

    https://godbolt.org/z/1TMPE5Wfr

    size_t
    foo (char const *buf, size_t len)
    {
      size_t sum = 0;
      size_t vl = __riscv_vsetvlmax_e8m8 ();
      size_t step = vl * 4;
      const char *it = buf, *end = buf + len;
      for (; it + step <= end;)
        {
          vint8m4_t v0 = __riscv_vle8_v_i8m4 ((void *) it, vl);
          it += vl;
          vint8m4_t v1 = __riscv_vle8_v_i8m4 ((void *) it, vl);
          it += vl;
          vint8m4_t v2 = __riscv_vle8_v_i8m4 ((void *) it, vl);
          it += vl;
          vint8m4_t v3 = __riscv_vle8_v_i8m4 ((void *) it, vl);
          it += vl;

          asm volatile("nop" ::: "memory");
          vint16m8_t vw0 = __riscv_vwcvt_x_x_v_i16m8 (v0, vl);
          vint16m8_t vw1 = __riscv_vwcvt_x_x_v_i16m8 (v1, vl);
          vint16m8_t vw2 = __riscv_vwcvt_x_x_v_i16m8 (v2, vl);
          vint16m8_t vw3 = __riscv_vwcvt_x_x_v_i16m8 (v3, vl);

          asm volatile("nop" ::: "memory");
          size_t sum0 = __riscv_vmv_x_s_i16m8_i16 (vw0);
          size_t sum1 = __riscv_vmv_x_s_i16m8_i16 (vw1);
          size_t sum2 = __riscv_vmv_x_s_i16m8_i16 (vw2);
          size_t sum3 = __riscv_vmv_x_s_i16m8_i16 (vw3);

          sum += sumation (sum0, sum1, sum2, sum3);
        }
      return sum;
    }

    Before this patch:

    ...
    csrr    t0,vlenb
    ...
            vwcvt.x.x.v     v16,v8
            vwcvt.x.x.v     v8,v28
            vs8r.v  v16,0(sp)               ---> spill
            vwcvt.x.x.v     v16,v24
            vwcvt.x.x.v     v24,v4
            nop
            vsetvli zero,zero,e16,m8,ta,ma
            vmv.x.s a2,v16
            vl8re16.v       v16,0(sp)      --->  reload
    ...
    csrr    t0,vlenb
    ...

    You can see heavy spill && reload inside the loop body.

    After this patch:

    ...
            vwcvt.x.x.v     v8,v12
            vwcvt.x.x.v     v16,v20
            vwcvt.x.x.v     v24,v28
            vwcvt.x.x.v     v0,v4
    ...

    Optimal codegen after this patch.

    Tested on zvl128b no regression.

    I am gonna to test zve64d/zvl256b/zvl512b/zvl1024b.

    Ok for trunk if no regression on the testing above ?

    Co-authored-by: kito-cheng <kito.ch...@sifive.com>
    Co-authored-by: kito-cheng <kito.ch...@gmail.com>

            PR target/112431

    gcc/ChangeLog:

            * config/riscv/constraints.md (TARGET_VECTOR ? V_REGS : NO_REGS):
New register filters.
            * config/riscv/riscv.md (no,W21,W42,W84,W41,W81,W82): Ditto.
            (no,yes): Ditto.
            * config/riscv/vector.md: Support highpart register overlap for
vwcvt.

    gcc/testsuite/ChangeLog:

            * gcc.target/riscv/rvv/base/pr112431-1.c: New test.
            * gcc.target/riscv/rvv/base/pr112431-2.c: New test.
            * gcc.target/riscv/rvv/base/pr112431-3.c: New test.

Reply via email to