[Bug target/114180] New: RISC-V: missing vsetvl changes tail policy and causes wrong codegen

2024-02-29 Thread camel-cdr at protonmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114180

Bug ID: 114180
   Summary: RISC-V: missing vsetvl changes tail policy and causes
wrong codegen
   Product: gcc
   Version: 13.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: camel-cdr at protonmail dot com
  Target Milestone: ---

There is a codegen bug for RVV intrinsics in gcc 13.2.0, tested on native
hardware and cross-compilers.
It's fixed in on trunk, but I'm not sure how your policies on back-porting
fixes is.

I wasn't able to find an existing bug report that matches the bug, but I think
this patch might be the one that fixed it in trunk:
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643934.html

Reproduction:

$ cat test.c
#include 

int validate_ascii(const char *buf, size_t len)  {
  size_t vlmax = __riscv_vsetvlmax_e8m8();
  vint8m8_t mask = __riscv_vmv_v_x_i8m8(0, vlmax);
  for (size_t vl; len > 0; len -= vl, buf += vl) {
vl = __riscv_vsetvl_e8m8(len);
vint8m8_t v = __riscv_vle8_v_i8m8((int8_t*)buf, vl);
mask = __riscv_vor_vv_i8m8_tu(mask, mask, v, vl);
  }
  return __riscv_vfirst_m_b1(__riscv_vmslt_vx_i8m8_b1(mask, 0, vlmax), vlmax) <
0;
}
$ gcc-13 -march=rv64gcv -O2 -S test.c
$ cat test.s
validate_ascii:
vsetvli a4,zero,e8,m8,ta,ma
vmv.v.i v24,0
beq a1,zero,.L2
.L3:
vsetvli a5,a1,e8,m8,ta,ma # <--- should be tu,ma
sub a1,a1,a5
vle8.v  v8,0(a0)
add a0,a0,a5
vor.vv  v24,v24,v8
bne a1,zero,.L3
vsetvli a4,zero,e8,m8,ta,ma
.L2:
vmslt.viv8,v24,0
vfirst.ma0,v8
srlia0,a0,63
ret

(output slightly cleaned up, and annotated)

See also, for online reproduction: https://godbolt.org/z/jsbT4dErs

[Bug target/114194] New: ICE when using std::unique_ptr with xtheadvector

2024-03-01 Thread camel-cdr at protonmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114194

Bug ID: 114194
   Summary: ICE when using std::unique_ptr with xtheadvector
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: camel-cdr at protonmail dot com
  Target Milestone: ---

Using std::unique_ptr with xtheadvector enabled causes an ICE:

#include 
extern void use(std::unique_ptr &x);
void test(size_t n) {
std::unique_ptr x;
use(x);
}

See also: https://godbolt.org/z/6nbhxKdfd

I've managed to reduce the problem to the following set of templates, but I
have no idea how this could cause the ICE.

struct S1 { int x; };
struct S2 { constexpr S2() { } template S2(T&); };
struct S3 : S1, S2 { constexpr S3() : S1() { } S3(S3&); };
void f(S3 &) { S3 x; f(x); }


See also: https://godbolt.org/z/5YxM6jd3s

It's extremely brittle, the ICE goes away, if you remove constexpr, the
reference, or any other part I could think of.

[Bug target/114686] New: Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension

2024-04-10 Thread camel-cdr at protonmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114686

Bug ID: 114686
   Summary: Feature request: Dynamic LMUL should be the default
for the RISC-V Vector extension
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: camel-cdr at protonmail dot com
  Target Milestone: ---

Currently, the default value for -mrvv-max-lmul is "m1", it should be "dynamic"
instead for the following reasons:

All currently available RVV implementations benefit from using the largest
LMUL, when possible (C906,C908,C920,ara,bobcat, see also this comment about the
SiFive cores:
https://gcc.gnu.org/pipermail/gcc-patches/2024-February/644676.html)

Some even to the degree that you are basically always wasting 50% of the
performance by using LMUL=1 instead of LMUL=2 or above, as you can see here for
the C908: https://camel-cdr.github.io/rvv-bench-results/canmv_k230/index.html

I don't see any reason why this wouldn't be the case for the vast majority of
implementations, especially high performance ones would benefit from having
more work to saturate the execution units with, since a larger LMUL works quite
similar to loop unrolling.

Also consider that using a lower LMUL than possible would make mask
instructions more expensive because they happen more frequently. With any
LMUL/SEW the mask fits into a single LMUL=1 vector register and can thus
(usually) execute in the same number of cycles regardless of LMUL. So in a loop
with LMUL=4 the mask operations are four times as fast per element as with
LMUL=1, because they occur less frequently.


Notes:

The vrgather.vv instruction should be except from that, because an LMUL=8
vrgather.vv is way more powerful than eight LMUL=1 vrgather.vv instructions,
and thus disproportionately complex to implement. When you don't need to cross
lanes, it's possible to unrolling LMUL=1 vrgathers manually, instead of
choosing a higher LMUL.

Here are throughput measurements on some existing implementations:
VLEN e8m1 e8m2 e8m4 e8m8
c906128  416   64   256
c908128  416   64.9 261.1
c920128  0.5  2.4  8.0  32.0
bobcat* 256  68   132  260  516
x280*   512  65   129  257  513

*bobcat: Note that it was explicitly stated, that they didn't optimize the
 permutation instructions
*x280: the numbers are from llvm-mca, but I was told they match reality. There
   is also supposed to be a vrgather fast path for vl<=256. I think there
   was much incentive to make this fast, as the x280 mostly targets AI.

vcompress.vm doesn't scale linearly with LMUL on the XuanTie chips either, but
a better implementation is conceivable, because the work can be better
distributed/subdivided. GCC currently doesn't seem to generate vcompress.vm via
auto-vectorization anyway: https://godbolt.org/z/Mb5Kba865

[Bug target/119373] RISC-V: missed unrolling opportunity

2025-04-23 Thread camel-cdr at protonmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119373

--- Comment #9 from camel-cdr  ---
Sorry, I missed that you attached the relevant C code.
Here is a side by side with and without -m-rvv-lmul-max=dynamic:
https://godbolt.org/z/MToxx813v

Using LLVM-MCA as a quick and dirty performance model shows that this reduces
the cycles by about 45%.

[Bug target/119373] RISC-V: missed unrolling opportunity

2025-04-23 Thread camel-cdr at protonmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119373

camel-cdr  changed:

   What|Removed |Added

 CC||camel-cdr at protonmail dot com

--- Comment #8 from camel-cdr  ---
I think the premiss that you want the loop explicitly unrolled is wrong, as
Robin wrote:

> Regarding unrolling: We cannot/do no unroll those length-controlled VLA loops.
What we can do, trivially, is increase LMUL.

I don't have access to the source, but gcc should produce the exepcted codegen
with -mrvv-max-lmul=dynamic. I don't know what that isn't the default and
instead gcc doesn't produce LMUL>1 by default, but that is another problem.