> If we were running into something weird like a failure to hoist a memory > reference out of a loop in the vlse64 version we'd see significant > discrepancies in how the icounts change. > > In the original code we have approx 14b fld instructions and 11b > vfmv.v.f instructions. After the change we have roughly 11b vlse64 > instructions, 3b fld instructions and virtually no vfmv.v.f. It's > almost an exact match for what one would expect. > > All the meaningful changes happen in one qemu translation block > > ORIG: >> 0x00000000000192c2 60057002880 27.4641% mat_times_vec_ >> mv t4,s4 >> add a1,a5,s10 >> add a6,a5,s11 >> snez a4,a4 >> ld a3,16(sp) >> mv t3,s3 >> fld fa2,0(a1) >> add a1,a5,s7 >> fld fa1,0(a6) >> neg a4,a4 >> mv t1,s2 >> mv a6,s0 >> fld fa4,0(a1) >> vmv.v.x v0,a4 >> mv a0,t2 >> mv a1,t0 >> mv a4,s5 >> sh3add a7,a2,a5 >> ld a2,8(sp) >> fld fa0,0(a7) >> vmsne.vi v0,v0,0 >> mv a7,s1 >> vfmv.v.f v15,ft0 >> vfmv.v.f v13,fa1 >> sh3add a2,a2,a5 >> add a5,a5,s6 >> vfmv.v.f v12,fa2 >> fld fa3,0(a2) >> fld fa5,0(a5) >> vfmv.v.f v10,fa4 >> mv a2,s5 >> vfmv.v.f v14,fa0 >> vfmv.v.f v11,fa3 >> vfmv.v.f v9,fa5 >> nop >> nop >> vsetvli a5,a3,e64,m1,ta,ma > > NEW: >> 0x00000000000192b8 56,810,678,400 27.0298% mat_times_vec_ >> mv t4,s4 >> mv t3,s3 >> mv t1,s2 >> add a3,a5,s10 >> sh3add a4,s5,a5 >> sh3add a1,s7,a5 >> add a2,a5,s11 >> vlse64.v v9,(t6),zero >> mv a7,s1 >> mv a6,s0 >> mv a0,t2 >> vlse64.v v13,(a3),zero >> ld a3,24(sp) >> sd a3,8(sp) >> vlse64.v v12,(a4),zero >> ld a4,16(sp) >> add a4,a4,a5 >> add a5,a5,a3 >> ld a3,32(sp) >> vlse64.v v10,(a5),zero >> vlse64.v v11,(a4),zero >> addi a4,t5,-1 >> snez a4,a4 >> neg a4,a4 >> vmv.v.x v0,a4 >> mv a4,s6 >> vmsne.vi v0,v0,0 >> vlse64.v v15,(a1),zero >> mv a1,t0 >> vlse64.v v14,(a2),zero >> mv a2,s6 >> nop >> nop >> nop >> vsetvli a5,a3,e64,m1,ta,ma
Is this the full BB? Just wondering because I'm seeing 7 vlse64 in the new block, 7 vfmv.f in the old block but only 6 flds. This would imply that one of the flds (that loads ft0) was outside the block before but now we're always reloading it inside as opposed to just broadcasting? Not saying that will account for the icount difference but is at least no 1-1 translation like "fld + vfmv.f = vlse64". > Which again looks exactly like one would expect from this optimization. I > haven't verified with 100% certainty, but I'm pretty sure the vectors in > question are full 8x64bit doubles based on finding what I'm fairly sure is > the vsetvl controlling these instructions. > > I can only conclude that the optimization is behaving per design and > that our uarch isn't handling this idiom performantly in the FP domain. Couldn't another interpretation be: "An optimized zero strided load is faster than a regular strided load but still slower than vfmv.f (+fld)"? -- Regards Robin
