[Bug target/120067] RISC-V: x264 sub4x4_dct high icount

rdapp.gcc at gmail dot com via Gcc-bugs Fri, 02 May 2025 11:38:49 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120067


--- Comment #2 from rdapp.gcc at gmail dot com ---
> This is reduced from 525.x264_r's 4th hottest block:
> https://godbolt.org/z/KdWv1er6f
>
> AArch64 assembly is clean and efficient (35 insns) while RISC-V's is long and
> messy (114 insns).
>
> The most obvious issue is that it keeps spilling and reloading the same data
> from the stack. Also I do not understand why we need those vslidedown.
>
> A rapid look at the expand dump (see attachment) shows that the latter come
> from VIEW_CONVERT_EXPR<vector(4) unsigned short>.
>
> I will keep looking into this.

If it's just about "short and non-messy" just drop the -mrvv-vector-bits=zvl.  
Without it our code is similar to aarch64's.

There is indeed a problem with =zvl that we have been discussing in the 
patchwork sync call for a while already.  The issue is that we're handling VLS 
modes as VLA modes in certain situations resulting in unfortunate codegen.  The 
vector extracts (slidedowns) are one example of it.

The whole function is not very easy to vectorize efficiently, though.  qemu 
icount will give a wrong impression here because usually segmented loads and 
stores are rather slow.  IMHO a good implementation will need to be longer and 
messier.

[Bug target/120067] RISC-V: x264 sub4x4_dct high icount

Reply via email to