Since the primary underlying scalar mode in the loop is DF, the autodetected
vector mode returned by preferred_simd_mode is RVVM1DF. In comparison, AArch64
picks VNx2DF, which allows the vectorisation factor to be 8. By choosing
RVVMF8QI, RISC-V is restricted to VF = 4.
Generally we pick the lar
This is reduced from 525.x264_r's 4th hottest block:
https://godbolt.org/z/KdWv1er6f
AArch64 assembly is clean and efficient (35 insns) while RISC-V's is long and
messy (114 insns).
The most obvious issue is that it keeps spilling and reloading the same data
from the stack. Also I do not underst