Since the primary underlying scalar mode in the loop is DF, the autodetected
vector mode returned by preferred_simd_mode is RVVM1DF. In comparison, AArch64
picks VNx2DF, which allows the vectorisation factor to be 8. By choosing
RVVMF8QI, RISC-V is restricted to VF = 4.
Generally we pick the largest type/mode that fits and supports all necessary
operations. As we don't support unpacking one into two vectors RVVMF8QI is the
most "natural" mode we can pick.
AArch64 saves 3 instructions per loop iteration thanks to its scaled
addressing
mode. But that is not enough to explain that the same basic block accounts for
79% more dynamic instructions on RISC-V compared to AArch64.
How much is left after accounting for addressing modes?
I manually set the unrolling factor to 2 in the RISC-V backend and re-ran the
510.parest_r benchmark. This resulted in a 11.5% decrease in DIC and a 4.5%
increase in code size, which I deem a fair tradeoff in -Ofast. The produced
assembly (see attachment) has a long epilogue but that should not be an issue
for any nontrivial trip count.
Do you think this is a desirable optimisation to pursue?
It depends, unrolling has several aspects to it. Part of the icount decrease
surely is due to decreased scalar loop overhead which won't matter on larger
cores.
aarch64 has implicit 2x unrolling when extending via unpack_lo/hi and as we
don't have those, LMUL2 is what comes to mind in riscv land. Whether that's
faster or not depends on the uarch.
On a real core I would expect the gather and the reduction to take most of the
cycles. If we assume an ideal OOO uarch that's fully pipelined I can't see a
lot of benefit using LMUL2 here, though. By the time we reach the gather in
the second iteration the indices should already be available and it could,
ideally, issue just one cycle after the one from the first iteration,
effectively taking only a single cycle. This assumption might be too
idealistic but that's the general idea.
With a very slow ordered reduction (possibly not fully pipelined either) there
might be a point in loading more data upfront. Suppose the second one cannot
start before the first one finishes - then we'd want it to operate on more data
right away.
If it turns out that e.g. LMUL2 is profitable for extending loops as in this
example (i.e. the same regime as aarch64's/x86's unpack) we could tie that to a
dynamic LMUL setting (or make it default even). Right now I think we still
have to little real-world data to properly decide.
--
Regards
Robin