On 10/3/25 8:03 AM, Robin Dapp wrote:
Is this the full BB? Just wondering because I'm seeing 7 vlse64 in the new
block, 7 vfmv.f in the old block but only 6 flds. This would imply that one
of the flds (that loads ft0) was outside the block before but now we're always
reloading it inside as opposed to just broadcasting? Not saying that will
account for the icount difference but is at least no 1-1 translation like
"fld + vfmv.f = vlse64".
No. It's a QEMU translation block, so it can start and end at fairly
arbitrary locations. I'd be reasonably confident the nops+vsetvl at
the end represents the start of a loop. So we're likely in a loop
preheader of some kind (at unknown loop depth) falling into a deeper
loop nest. If we were to chase further I'd look at lower PCs rather
than higher PCs for the "missing" fld.
Which again looks exactly like one would expect from this optimization. I
haven't verified with 100% certainty, but I'm pretty sure the vectors in
question are full 8x64bit doubles based on finding what I'm fairly sure is
the vsetvl controlling these instructions.
I can only conclude that the optimization is behaving per design and
that our uarch isn't handling this idiom performantly in the FP domain.
Couldn't another interpretation be:
"An optimized zero strided load is faster than a regular strided load but still
slower than vfmv.f (+fld)"?
You could interpret it that way, though it's hard to see how that could
really account for the performance difference observed.
Jeff