Re: [Bug target/119373] RISC-V: missed unrolling opportunity

Robin Dapp via Gcc-bugs Thu, 24 Apr 2025 02:32:39 -0700

Since the primary underlying scalar mode in the loop is DF, the autodetected
vector mode returned by preferred_simd_mode is RVVM1DF. In comparison, AArch64
picks VNx2DF, which allows the vectorisation factor to be 8. By choosing
RVVMF8QI, RISC-V is restricted to VF = 4.

Generally we pick the largest type/mode that fits and supports all necessaryoperations. As we don't support unpacking one into two vectors RVVMF8QI is themost "natural" mode we can pick.

AArch64 saves 3 instructions per loop iteration thanks to its scaledaddressing
mode. But that is not enough to explain that the same basic block accounts for
79% more dynamic instructions on RISC-V compared to AArch64.


How much is left after accounting for addressing modes?

I manually set the unrolling factor to 2 in the RISC-V backend and re-ran the
510.parest_r benchmark. This resulted in a 11.5% decrease in DIC and a 4.5%
increase in code size, which I deem a fair tradeoff in -Ofast. The produced
assembly (see attachment) has a long epilogue but that should not be an issue
for any nontrivial trip count.

Do you think this is a desirable optimisation to pursue?

It depends, unrolling has several aspects to it. Part of the icount decreasesurely is due to decreased scalar loop overhead which won't matter on largercores.

aarch64 has implicit 2x unrolling when extending via unpack_lo/hi and as wedon't have those, LMUL2 is what comes to mind in riscv land. Whether that'sfaster or not depends on the uarch.

On a real core I would expect the gather and the reduction to take most of thecycles. If we assume an ideal OOO uarch that's fully pipelined I can't see alot of benefit using LMUL2 here, though. By the time we reach the gather inthe second iteration the indices should already be available and it could,ideally, issue just one cycle after the one from the first iteration,effectively taking only a single cycle. This assumption might be tooidealistic but that's the general idea.

With a very slow ordered reduction (possibly not fully pipelined either) theremight be a point in loading more data upfront. Suppose the second one cannotstart before the first one finishes - then we'd want it to operate on more dataright away.

If it turns out that e.g. LMUL2 is profitable for extending loops as in thisexample (i.e. the same regime as aarch64's/x86's unpack) we could tie that to adynamic LMUL setting (or make it default even). Right now I think we stillhave to little real-world data to properly decide.


--
Regards
Robin

Re: [Bug target/119373] RISC-V: missed unrolling opportunity

Reply via email to