Hi Zhili & Rémi Denis-Courmont,
Thank you very much, Zhao Zhili, for the helpful pointers! I’ll immediately
review the latest RVV-related patches and pull requests on code.ffmpeg.org
<https://code.ffmpeg.org/ > and study your excellent summary on assembly
optimization.
I also sincerely appreciate Rémi Denis-Courmont’s detailed feedback. In
response to the points you raised, I’d like to share a bit about our current
efforts:
*
RISE Multimedia Group: We’ll reach out to our internal colleagues to check
whether there are any ongoing initiatives within that community, to avoid
duplication and explore potential collaboration.
*
Segmented load/store performance: We’ve encountered similar bottlenecks in our
video decoding optimizations. To address this, we’re actively proposing new
vector instructions tailored for media workloads to the RISC-V International
standards body. At the same time, we’re working closely with RISC-V CPU
microarchitecture teams to improve the hardware efficiency of these memory
operations.
*
Scalable VLEN (variable vector length) challenges: I fully agree with your
observation. The scalability of RVV is meant to provide flexibility—ideally, a
single optimized implementation should adapt gracefully across different VLEN
configurations. However, in practice, video codecs like HEVC predominantly use
fixed-size blocks (e.g., 4×4, 8×8). As a result, an algorithm optimized for
VLEN=128 may not perform better—or may even regress—on a VLEN=256 system,
despite the latter having higher theoretical compute throughput. This forces us
to develop separate optimizations per VLEN, which undermines the original
intent of RVV’s scalability. We believe there’s significant room for discussion
and innovation here—both in software strategies and hardware design.
We recognize that the RISC-V vector ecosystem is still evolving rapidly.
Nevertheless, we’re confident that through close hardware-software co-design,
RVV can become highly competitive in video coding workloads over time. RISC-V’s
open nature makes it especially well-suited for such collaborative
improvements—and we warmly welcome any performance insights, suggestions, or
discussions from the community.
Thank you again for your valuable input and support!
Best regards,
Yunfei Zhou
Alibaba DAMO Academy
--
发件人:Rémi Denis-Courmont via ffmpeg-devel
发送时间:2025年11月14日(周五) 22:22
收件人:FFmpeg development discussions and patches
抄 送:"[email protected]"; "Rémi
Denis-Courmont"
主 题:[FFmpeg-devel] Re: [Question]Inquiry Regarding RISC-V RVV Optimization for
HEVC Decoding in FFmpeg
Nihao,
Le 14 novembre 2025 03:52:51 GMT+02:00, yunfei_zhou--- via ffmpeg-devel
a écrit :
>Before proceeding, we would like to understand whether there are any existing
>or ongoing efforts in this area to avoid duplication and, ideally, align or
>collaborate with current initiatives.
Existing code you can find in the official Git repo. Ongoing efforts are
unknown to us. You had probably better ask the RISE multimedia group than
FFmpeg-devel. I suppose you or one of your colleagues should have access. (I
don't anyone here has.)
> *
>Available documentation or resources that could help us better understand the
>existing codebase and optimization strategies.
To be honest, in my experience, while it is obviously possible to optimise
video decoding with RVV, the current implementations are not competitive (with
e.g. Armv8 AdvSIMD) due most particularly to two aspects:
1) Segmented loads&stores are slow. Because video decoding often involves
transposition, we would really need segmented unit-strided accesses to run as
fast or almost as fast as single-segment unit-strided accesses of the same
size. Likewise we need segmented register-strided accesses to be almost as fast
as single-segment register strided accesses.
2) Because RVV is scalable, and video decoding uses a lot of fixed-size and/or
small vectors, we need instruction execution cost to scale according to VL or
next_power_of_two(VL). Currently it seems to scale according to VLMAX, which
means larger vectors make optimisations worse rather than better.
(This is based on benchmarks for your C910 and C908 cores, and SpacemiT's X60.
I don't have access to any other hardware at the moment.)
Point being, the available hardware seems a little bit immature, so we don't
really have settled optimisations strategies.
Br,
___
ffmpeg-devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]
___
ffmpeg-devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]