This is reduced from 525.x264_r's 4th hottest block:
https://godbolt.org/z/KdWv1er6f
AArch64 assembly is clean and efficient (35 insns) while RISC-V's is long and
messy (114 insns).
The most obvious issue is that it keeps spilling and reloading the same data
from the stack. Also I do not understand why we need those vslidedown.
A rapid look at the expand dump (see attachment) shows that the latter come
from VIEW_CONVERT_EXPR<vector(4) unsigned short>.
I will keep looking into this.
If it's just about "short and non-messy" just drop the -mrvv-vector-bits=zvl.
Without it our code is similar to aarch64's.
There is indeed a problem with =zvl that we have been discussing in the
patchwork sync call for a while already. The issue is that we're handling VLS
modes as VLA modes in certain situations resulting in unfortunate codegen. The
vector extracts (slidedowns) are one example of it.
The whole function is not very easy to vectorize efficiently, though. qemu
icount will give a wrong impression here because usually segmented loads and
stores are rather slow. IMHO a good implementation will need to be longer and
messier.
--
Regards
Robin