[Bug target/118057] RISC-V: Can't vectorize load and store with zvl128b

2024-12-16 Thread rdapp.gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118057

--- Comment #4 from rdapp.gcc at gmail dot com ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118057
>
> --- Comment #3 from JuzheZhong  ---
> (In reply to Robin Dapp from comment #2)
>> I think depending on the performance of strided loads/stores this can be
>> profitable to vectorize.  Looks like we need loop versioning to account for
>> the possible aliasing but once this is out of the way we could be OK.
>> 
>> I have a local patch that uses strided stores here (in the limited example)
>> but that's GCC 16 material.
>
> I believe strided/indexed loads/stores are pretty expensive in most of the
> hardware. For example, we have tested 625 X264 reference.
>
> Clang use indexed load/store vectorize pixel_satd_8x4 wheras GCC is SLP
> vectorizing with small length unit-stride load/store.
>
> In K1:
> gcc-14 real 24m2629, clang-20 real 30m51.174s.
>
> Big performance drop from gcc-14 to clang-20.
>
> Compile option: -march=rv64gcv_zvl256b -mrvv-vector-bits=zvl,
> -mrvv-max-lmul=m2.

Yes, I agree that costing is not particularly easy with.  In particular given
the fragmentation of the microarchitectures and their very different
performance characteristics.

On the other hand, we have a local patch that speeds up x264 SATD
significantly on our uarch with the help of strided loads.

My impression is that we surely don't want to universally use strided loads
all the time and need to pay attention to make reasonable costing decisions but
there are cases where they help.

[Bug tree-optimization/115340] Loop/SLP vectorization possible inefficiency

2025-01-08 Thread rdapp.gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115340

--- Comment #6 from rdapp.gcc at gmail dot com ---
>> Another thought I had as we already know that SLP handles this more 
>> gracefully:
>> Would it make sense to "just" defer to BB vectorization and have loop
>> vectorization not do anything, provided we could detect the pattern with
>> certainty?  That would still be special casing the situation but potentially
>> less intrusive than "Hail Mary" unrolling.
>
> Yes, I would expect costing to ensure we don't loop vectorize it, but then
> we don't (and can't easily IMO) compare loop vectorization to
> basic-block vectorization after unrolling cost-wise, so ...

Sure, I see the predicament.

Just to make sure I understand:  If we performed virtual unrolling (and
re-rolling) in the vectorizer couldn't we "natively" recognize and vectorize
this pattern and no special handling would be necessary if we e.g.
attempted VF=16?

So the attempt to recognize during early unroll would be a stop gap until
that's fully working and in place (which might take longer than one GCC cycle)?

[Bug tree-optimization/115340] Loop/SLP vectorization possible inefficiency

2025-01-08 Thread rdapp.gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115340

--- Comment #4 from rdapp.gcc at gmail dot com ---
> That said - if DR analysis could, say, "force" a particular VF where it
> knows that gaps are closed we might "virtually" unroll this and thus
> detect it as a group of contiguous 16 stores.  Now we'd need to do the
> same virtual unrolling for all other stmts of course.
>
> I think it would be easier if we'd somehow detect this situation beforehand
> and actually perform the unrolling - we might want to do it with a
> if (.LOOP_VECTORIZED (...)) versioning scheme though.  I do wonder how
> common such loops are though.
>
> It might be also possible to override cost considerations of early
> unrolling with -O3 (aka when vectorization is enabled) and when the
> number of iterations matches the gap of related DRs (but as said, it
> looks like a very special thing to do).
>
> That said - I do plan to change the vectorizer from iterating over modes
> to iterating over VFs which means we could perform the unrolling implied
> by the VF on the vectorizer IL (SLP) and (re-)perform group discovery
> afterwards.
>
> For a more general loop we'd essentially apply blocking with the desired
> VF, unroll that blocking loop and apply BB vectorization.
>
> So to make the point - I don't like how handling this special case within
> the current vectorizer framework pays off with the cost this will have
> (I'm not sure it's really feasible to add even).  Instead this looks
> like in need of a vectorization enablement pre-transform to me.

OK, sounds reasonable.  And yeah, I wouldn't claim this kind of loop is common,
it's obviously an x264 thing.  Perhaps in other codecs but I haven't really
checked.

Another thought I had as we already know that SLP handles this more gracefully:
Would it make sense to "just" defer to BB vectorization and have loop
vectorization not do anything, provided we could detect the pattern with
certainty?  That would still be special casing the situation but potentially
less intrusive than "Hail Mary" unrolling.

[Bug target/119373] RISC-V: missed unrolling opportunity

2025-04-24 Thread rdapp.gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119373

--- Comment #10 from rdapp.gcc at gmail dot com ---
> Since the primary underlying scalar mode in the loop is DF, the autodetected
> vector mode returned by preferred_simd_mode is RVVM1DF. In comparison, AArch64
> picks VNx2DF, which allows the vectorisation factor to be 8. By choosing
> RVVMF8QI, RISC-V is restricted to VF = 4.

Generally we pick the largest type/mode that fits and supports all necessary 
operations.  As we don't support unpacking one into two vectors RVVMF8QI is the 
most "natural" mode we can pick.

> AArch64 saves 3 instructions per loop iteration thanks to its scaled 
> addressing
> mode. But that is not enough to explain that the same basic block accounts for
> 79% more dynamic instructions on RISC-V compared to AArch64.

How much is left after accounting for addressing modes?

> I manually set the unrolling factor to 2 in the RISC-V backend and re-ran the
> 510.parest_r benchmark. This resulted in a 11.5% decrease in DIC and a 4.5%
> increase in code size, which I deem a fair tradeoff in -Ofast. The produced
> assembly (see attachment) has a long epilogue but that should not be an issue
> for any nontrivial trip count.
>
> Do you think this is a desirable optimisation to pursue?

It depends, unrolling has several aspects to it.  Part of the icount decrease 
surely is due to decreased scalar loop overhead which won't matter on larger 
cores.

aarch64 has implicit 2x unrolling when extending via unpack_lo/hi and as we 
don't have those, LMUL2 is what comes to mind in riscv land.  Whether that's 
faster or not depends on the uarch.

On a real core I would expect the gather and the reduction to take most of the 
cycles.  If we assume an ideal OOO uarch that's fully pipelined I can't see a 
lot of benefit using LMUL2 here, though.  By the time we reach the gather in 
the second iteration the indices should already be available and it could, 
ideally, issue just one cycle after the one from the first iteration, 
effectively taking only a single cycle.  This assumption might be too 
idealistic but that's the general idea.

With a very slow ordered reduction (possibly not fully pipelined either) there 
might be a point in loading more data upfront.  Suppose the second one cannot 
start before the first one finishes - then we'd want it to operate on more data 
right away.

If it turns out that e.g. LMUL2 is profitable for extending loops as in this 
example (i.e. the same regime as aarch64's/x86's unpack) we could tie that to a 
dynamic LMUL setting (or make it default even).  Right now I think we still 
have to little real-world data to properly decide.