https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119911

            Bug ID: 119911
           Summary: [RVV] Suboptimal code generation for multiple
                    extracting 0-th elements of vector
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

I observed the issue on GCC 14.2, but it's still visible on the godbolt trunk,
which is 16.0.0 20250423 (experimental).

Summary: when we have multiple `vmv.x.s` (move the 0th vector element into a
scalar register), GCC always emit shift-left then shift-right to apply masking
of result lower bits (like 8 or 16). However, when there are more `vmv.x.s`
instances, then it would be profitable to create the mask in a register (which
is a compile-time const) and use bit-and for masking.

Clang performs this optimization.

Consider this simple function:

---test.cpp---
#include <riscv_vector.h>
#include <cstdint>

uint64_t sum_of_first_three(vuint16m1_t x) {
    const uint64_t mask = 0xffff;
    const auto vl = __riscv_vsetvlmax_e16m1();
    return uint64_t(__riscv_vmv_x_s_u16m1_u16(x))
         + uint64_t(__riscv_vmv_x_s_u16m1_u16(__riscv_vslidedown(x, 1, vl)))
         + uint64_t(__riscv_vmv_x_s_u16m1_u16(__riscv_vslidedown(x, 2, vl)));
}
---eof---

When compiled with `-O3 -march=rv64gcv`, the assembly is:

---
sum_of_first_three(__rvv_uint16m1_t):
        vsetvli a5,zero,e16,m1,ta,ma
        vslidedown.vi   v10,v8,1
        vslidedown.vi   v9,v8,2
        vmv.x.s a5,v8
        vmv.x.s a4,v10
        vmv.x.s a0,v9
        slli    a4,a4,48
        slli    a5,a5,48
        srli    a4,a4,48
        srli    a5,a5,48
        slli    a0,a0,48
        add     a5,a5,a4
        srli    a0,a0,48
        add     a0,a5,a0
        ret
---

godbolt link: https://godbolt.org/z/hPrM8vz4v

Reply via email to