https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119911
Bug ID: 119911 Summary: [RVV] Suboptimal code generation for multiple extracting 0-th elements of vector Product: gcc Version: 16.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: wojciech_mula at poczta dot onet.pl Target Milestone: --- I observed the issue on GCC 14.2, but it's still visible on the godbolt trunk, which is 16.0.0 20250423 (experimental). Summary: when we have multiple `vmv.x.s` (move the 0th vector element into a scalar register), GCC always emit shift-left then shift-right to apply masking of result lower bits (like 8 or 16). However, when there are more `vmv.x.s` instances, then it would be profitable to create the mask in a register (which is a compile-time const) and use bit-and for masking. Clang performs this optimization. Consider this simple function: ---test.cpp--- #include <riscv_vector.h> #include <cstdint> uint64_t sum_of_first_three(vuint16m1_t x) { const uint64_t mask = 0xffff; const auto vl = __riscv_vsetvlmax_e16m1(); return uint64_t(__riscv_vmv_x_s_u16m1_u16(x)) + uint64_t(__riscv_vmv_x_s_u16m1_u16(__riscv_vslidedown(x, 1, vl))) + uint64_t(__riscv_vmv_x_s_u16m1_u16(__riscv_vslidedown(x, 2, vl))); } ---eof--- When compiled with `-O3 -march=rv64gcv`, the assembly is: --- sum_of_first_three(__rvv_uint16m1_t): vsetvli a5,zero,e16,m1,ta,ma vslidedown.vi v10,v8,1 vslidedown.vi v9,v8,2 vmv.x.s a5,v8 vmv.x.s a4,v10 vmv.x.s a0,v9 slli a4,a4,48 slli a5,a5,48 srli a4,a4,48 srli a5,a5,48 slli a0,a0,48 add a5,a5,a4 srli a0,a0,48 add a0,a5,a0 ret --- godbolt link: https://godbolt.org/z/hPrM8vz4v