Re: [PATCH][RFC] RISC-V: Allow FP strided broadcast from memory [PR121451]

Jeff Law Sat, 18 Oct 2025 01:36:19 -0700



On 10/3/25 8:03 AM, Robin Dapp wrote:


Is this the full BB?  Just wondering because I'm seeing 7 vlse64 in the new
block, 7 vfmv.f in the old block but only 6 flds.  This would imply that one
of the flds (that loads ft0) was outside the block before but now we're always
reloading it inside as opposed to just broadcasting?  Not saying that will
account for the icount difference but is at least no 1-1 translation like
"fld + vfmv.f = vlse64".

No. It's a QEMU translation block, so it can start and end at fairlyarbitrary locations. I'd be reasonably confident the nops+vsetvl atthe end represents the start of a loop. So we're likely in a looppreheader of some kind (at unknown loop depth) falling into a deeperloop nest. If we were to chase further I'd look at lower PCs ratherthan higher PCs for the "missing" fld.

Which again looks exactly like one would expect from this optimization. I
haven't verified with 100% certainty, but I'm pretty sure the vectors in
question are full 8x64bit doubles based on finding what I'm fairly sure is
the vsetvl controlling these instructions.

I can only conclude that the optimization is behaving per design and
that our uarch isn't handling this idiom performantly in the FP domain.


Couldn't another interpretation be:
"An optimized zero strided load is faster than a regular strided load but still
slower than vfmv.f (+fld)"?

You could interpret it that way, though it's hard to see how that couldreally account for the performance difference observed.


Jeff

Re: [PATCH][RFC] RISC-V: Allow FP strided broadcast from memory [PR121451]

Reply via email to