Re: [Bug target/120067] New: RISC-V: x264 sub4x4_dct high icount

Robin Dapp via Gcc-bugs Fri, 02 May 2025 11:38:16 -0700

This is reduced from 525.x264_r's 4th hottest block:
https://godbolt.org/z/KdWv1er6f


AArch64 assembly is clean and efficient (35 insns) while RISC-V's is long and
messy (114 insns).

The most obvious issue is that it keeps spilling and reloading the same data
from the stack. Also I do not understand why we need those vslidedown.

A rapid look at the expand dump (see attachment) shows that the latter come
from VIEW_CONVERT_EXPR<vector(4) unsigned short>.

I will keep looking into this.

If it's just about "short and non-messy" just drop the -mrvv-vector-bits=zvl.Without it our code is similar to aarch64's.

There is indeed a problem with =zvl that we have been discussing in thepatchwork sync call for a while already. The issue is that we're handling VLSmodes as VLA modes in certain situations resulting in unfortunate codegen. Thevector extracts (slidedowns) are one example of it.

The whole function is not very easy to vectorize efficiently, though. qemuicount will give a wrong impression here because usually segmented loads andstores are rather slow. IMHO a good implementation will need to be longer andmessier.


--
Regards
Robin

Re: [Bug target/120067] New: RISC-V: x264 sub4x4_dct high icount

Reply via email to