Re: [Bug target/119373] RISC-V: missed unrolling opportunity

2025-04-24 Thread Robin Dapp via Gcc-bugs

Since the primary underlying scalar mode in the loop is DF, the autodetected
vector mode returned by preferred_simd_mode is RVVM1DF. In comparison, AArch64
picks VNx2DF, which allows the vectorisation factor to be 8. By choosing
RVVMF8QI, RISC-V is restricted to VF = 4.


Generally we pick the largest type/mode that fits and supports all necessary 
operations.  As we don't support unpacking one into two vectors RVVMF8QI is the 
most "natural" mode we can pick.


AArch64 saves 3 instructions per loop iteration thanks to its scaled 
addressing

mode. But that is not enough to explain that the same basic block accounts for
79% more dynamic instructions on RISC-V compared to AArch64.


How much is left after accounting for addressing modes?


I manually set the unrolling factor to 2 in the RISC-V backend and re-ran the
510.parest_r benchmark. This resulted in a 11.5% decrease in DIC and a 4.5%
increase in code size, which I deem a fair tradeoff in -Ofast. The produced
assembly (see attachment) has a long epilogue but that should not be an issue
for any nontrivial trip count.

Do you think this is a desirable optimisation to pursue?


It depends, unrolling has several aspects to it.  Part of the icount decrease 
surely is due to decreased scalar loop overhead which won't matter on larger 
cores.


aarch64 has implicit 2x unrolling when extending via unpack_lo/hi and as we 
don't have those, LMUL2 is what comes to mind in riscv land.  Whether that's 
faster or not depends on the uarch.


On a real core I would expect the gather and the reduction to take most of the 
cycles.  If we assume an ideal OOO uarch that's fully pipelined I can't see a 
lot of benefit using LMUL2 here, though.  By the time we reach the gather in 
the second iteration the indices should already be available and it could, 
ideally, issue just one cycle after the one from the first iteration, 
effectively taking only a single cycle.  This assumption might be too 
idealistic but that's the general idea.


With a very slow ordered reduction (possibly not fully pipelined either) there 
might be a point in loading more data upfront.  Suppose the second one cannot 
start before the first one finishes - then we'd want it to operate on more data 
right away.


If it turns out that e.g. LMUL2 is profitable for extending loops as in this 
example (i.e. the same regime as aarch64's/x86's unpack) we could tie that to a 
dynamic LMUL setting (or make it default even).  Right now I think we still 
have to little real-world data to properly decide.


--
Regards
Robin



Re: [Bug target/120067] New: RISC-V: x264 sub4x4_dct high icount

2025-05-02 Thread Robin Dapp via Gcc-bugs

This is reduced from 525.x264_r's 4th hottest block:
https://godbolt.org/z/KdWv1er6f

AArch64 assembly is clean and efficient (35 insns) while RISC-V's is long and
messy (114 insns).

The most obvious issue is that it keeps spilling and reloading the same data
from the stack. Also I do not understand why we need those vslidedown.

A rapid look at the expand dump (see attachment) shows that the latter come
from VIEW_CONVERT_EXPR.

I will keep looking into this.


If it's just about "short and non-messy" just drop the -mrvv-vector-bits=zvl.  
Without it our code is similar to aarch64's.


There is indeed a problem with =zvl that we have been discussing in the 
patchwork sync call for a while already.  The issue is that we're handling VLS 
modes as VLA modes in certain situations resulting in unfortunate codegen.  The 
vector extracts (slidedowns) are one example of it.


The whole function is not very easy to vectorize efficiently, though.  qemu 
icount will give a wrong impression here because usually segmented loads and 
stores are rather slow.  IMHO a good implementation will need to be longer and 
messier.


--
Regards
Robin