On 9/23/25 1:45 PM, Paul-Antoine Arras wrote:
I experimented with this patch which allows to remove a vfmv when a
floating-point op can be loaded directly from memory with a zero-stride
vlse.
In terms of benchmarks, I measured the following reductions in icount:
* 503.bwaves: -4.0%
* 538.imagick: -3.3%
* 549.fotonik3d: -0.34%
However, the icount for 507.cactuBSSN increased by 0.43%. In addition,
measurements on the BPI board show that the patch actually increases
execution times by 5 to 11%.
This may still be beneficial for some uarchs but would have to be
tunable, wouldn't it?
Is worth proceeding with this?
So I looked a bit deeper at the instruction mix data for bwaves; I was
kind of hoping to see something odd happening that would explain the
performance behavior, but no such luck.
If we were running into something weird like a failure to hoist a memory
reference out of a loop in the vlse64 version we'd see significant
discrepancies in how the icounts change.
In the original code we have approx 14b fld instructions and 11b
vfmv.v.f instructions. After the change we have roughly 11b vlse64
instructions, 3b fld instructions and virtually no vfmv.v.f. It's
almost an exact match for what one would expect.
All the meaningful changes happen in one qemu translation block
ORIG:
0x00000000000192c2 60057002880 27.4641% mat_times_vec_
mv t4,s4
add a1,a5,s10
add a6,a5,s11
snez a4,a4
ld a3,16(sp)
mv t3,s3
fld fa2,0(a1)
add a1,a5,s7
fld fa1,0(a6)
neg a4,a4
mv t1,s2
mv a6,s0
fld fa4,0(a1)
vmv.v.x v0,a4
mv a0,t2
mv a1,t0
mv a4,s5
sh3add a7,a2,a5
ld a2,8(sp)
fld fa0,0(a7)
vmsne.vi v0,v0,0
mv a7,s1
vfmv.v.f v15,ft0
vfmv.v.f v13,fa1
sh3add a2,a2,a5
add a5,a5,s6
vfmv.v.f v12,fa2
fld fa3,0(a2)
fld fa5,0(a5)
vfmv.v.f v10,fa4
mv a2,s5
vfmv.v.f v14,fa0
vfmv.v.f v11,fa3
vfmv.v.f v9,fa5
nop
nop
vsetvli a5,a3,e64,m1,ta,ma
NEW:
0x00000000000192b8 56,810,678,400 27.0298% mat_times_vec_
mv t4,s4
mv t3,s3
mv t1,s2
add a3,a5,s10
sh3add a4,s5,a5
sh3add a1,s7,a5
add a2,a5,s11
vlse64.v v9,(t6),zero
mv a7,s1
mv a6,s0
mv a0,t2
vlse64.v v13,(a3),zero
ld a3,24(sp)
sd a3,8(sp)
vlse64.v v12,(a4),zero
ld a4,16(sp)
add a4,a4,a5
add a5,a5,a3
ld a3,32(sp)
vlse64.v v10,(a5),zero
vlse64.v v11,(a4),zero
addi a4,t5,-1
snez a4,a4
neg a4,a4
vmv.v.x v0,a4
mv a4,s6
vmsne.vi v0,v0,0
vlse64.v v15,(a1),zero
mv a1,t0
vlse64.v v14,(a2),zero
mv a2,s6
nop
nop
nop
vsetvli a5,a3,e64,m1,ta,ma
Which again looks exactly like one would expect from this optimization.
I haven't verified with 100% certainty, but I'm pretty sure the vectors
in question are full 8x64bit doubles based on finding what I'm fairly
sure is the vsetvl controlling these instructions.
I can only conclude that the optimization is behaving per design and
that our uarch isn't handling this idiom performantly in the FP domain.
So what I would suggest would be to add another tuning flag so that we
can distinguish between FP and integer cases and make this change
conditional on the uarch asking for this behavior.
Given we haven't yet seen a design where this is profitable, just make
it false across the board for all the upstreamed uarchs except -Os where
we likely want it on. Obviously it's disappointing, but I wouldn't want
to lose the work as I do think this performance quirk we're seeing will
be fixed in future designs.
Other thoughts?
Jeff