[Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V

rguenth at gcc dot gnu.org via Gcc-bugs Fri, 13 Sep 2024 00:51:10 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573


--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, it does seem to be a correctness issue as well, I see multiple execute
FAILs in the gcc.dg/vect testsuite when removing the check and running with
-march=rv64gcv.

So I would have expected, similar to how we handle
LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P, that .SELECT_VL usage is disabled by
stmt level analysis when it cannot be used (possibly by vect_record_loop_len
which _should_ have all the required information?).

That said, analyzing a FAIL, like for example gcc.dg/vect/vect-vfa-slp.c
which looks very simple would help.  It seems that we fail to multiply
the .SELECT_VL result by the SLP group size there?

.L8:
        vsetvli a5,a4,e16,m1,ta,ma
        vle16.v v1,0(a0)
        slli    a3,a5,2
        sub     a4,a4,a5
        add     a0,a0,a3
        vadd.vv v1,v1,v2
        vse16.v v1,0(a1)
        add     a1,a1,a3
        bne     a4,zero,.L8

So to preserve previous behavior, instead of checking for !slp verifying
that each SLP instance only has single-lane operations (note though that
the stores and loads will still be represented as multi-lane, but
load-/store-lanes might work).  But as in principle SLP instances can
fork/merge controlling this individually would make more sense.

I don't know what the constraints are for vsetvli, but for the above case
I think we'd want to feed it 2*a4 since we know we uniformly need two
elements per iteration per vector.  I'd hope when feeding vsetvli for
example 12 as length that it will never actually set the length to an
odd number.  Similar with three elements per iteration, when we feed it,
say 9, how can we make sure it will set the length to a multiple of three?

It looks like you side-stepped all this by disabling .SELECT_VL with SLP.
As said, it should still work for uniform single-lane cases and for loads
when using load/store-lanes exclusively.  I can see to implement that as
check where we currently have the !slp check though as said, more carefully
handling cases we could support would be nice - like the above case with
two elements?  As said, I don't know the constraints RVV places on
implementations here and the spec isn't exactly helpful either.

[Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V

Reply via email to