https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119373

--- Comment #10 from rdapp.gcc at gmail dot com ---
> Since the primary underlying scalar mode in the loop is DF, the autodetected
> vector mode returned by preferred_simd_mode is RVVM1DF. In comparison, AArch64
> picks VNx2DF, which allows the vectorisation factor to be 8. By choosing
> RVVMF8QI, RISC-V is restricted to VF = 4.

Generally we pick the largest type/mode that fits and supports all necessary 
operations.  As we don't support unpacking one into two vectors RVVMF8QI is the 
most "natural" mode we can pick.

> AArch64 saves 3 instructions per loop iteration thanks to its scaled 
> addressing
> mode. But that is not enough to explain that the same basic block accounts for
> 79% more dynamic instructions on RISC-V compared to AArch64.

How much is left after accounting for addressing modes?

> I manually set the unrolling factor to 2 in the RISC-V backend and re-ran the
> 510.parest_r benchmark. This resulted in a 11.5% decrease in DIC and a 4.5%
> increase in code size, which I deem a fair tradeoff in -Ofast. The produced
> assembly (see attachment) has a long epilogue but that should not be an issue
> for any nontrivial trip count.
>
> Do you think this is a desirable optimisation to pursue?

It depends, unrolling has several aspects to it.  Part of the icount decrease 
surely is due to decreased scalar loop overhead which won't matter on larger 
cores.

aarch64 has implicit 2x unrolling when extending via unpack_lo/hi and as we 
don't have those, LMUL2 is what comes to mind in riscv land.  Whether that's 
faster or not depends on the uarch.

On a real core I would expect the gather and the reduction to take most of the 
cycles.  If we assume an ideal OOO uarch that's fully pipelined I can't see a 
lot of benefit using LMUL2 here, though.  By the time we reach the gather in 
the second iteration the indices should already be available and it could, 
ideally, issue just one cycle after the one from the first iteration, 
effectively taking only a single cycle.  This assumption might be too 
idealistic but that's the general idea.

With a very slow ordered reduction (possibly not fully pipelined either) there 
might be a point in loading more data upfront.  Suppose the second one cannot 
start before the first one finishes - then we'd want it to operate on more data 
right away.

If it turns out that e.g. LMUL2 is profitable for extending loops as in this 
example (i.e. the same regime as aarch64's/x86's unpack) we could tie that to a 
dynamic LMUL setting (or make it default even).  Right now I think we still 
have to little real-world data to properly decide.

Reply via email to