This is reduced from 525.x264_r's 4th hottest block:
https://godbolt.org/z/KdWv1er6f

AArch64 assembly is clean and efficient (35 insns) while RISC-V's is long and
messy (114 insns).

The most obvious issue is that it keeps spilling and reloading the same data
from the stack. Also I do not understand why we need those vslidedown.

A rapid look at the expand dump (see attachment) shows that the latter come
from VIEW_CONVERT_EXPR<vector(4) unsigned short>.

I will keep looking into this.

If it's just about "short and non-messy" just drop the -mrvv-vector-bits=zvl. Without it our code is similar to aarch64's.

There is indeed a problem with =zvl that we have been discussing in the patchwork sync call for a while already. The issue is that we're handling VLS modes as VLA modes in certain situations resulting in unfortunate codegen. The vector extracts (slidedowns) are one example of it.

The whole function is not very easy to vectorize efficiently, though. qemu icount will give a wrong impression here because usually segmented loads and stores are rather slow. IMHO a good implementation will need to be longer and messier.

--
Regards
Robin

Reply via email to