Re: [PATCH][RFC] RISC-V: Allow FP strided broadcast from memory [PR121451]

Jeff Law Thu, 02 Oct 2025 12:48:38 -0700



On 9/23/25 1:45 PM, Paul-Antoine Arras wrote:

I experimented with this patch which allows to remove a vfmv when afloating-point op can be loaded directly from memory with a zero-stridevlse.
In terms of benchmarks, I measured the following reductions in icount:
* 503.bwaves: -4.0%
* 538.imagick: -3.3%
* 549.fotonik3d: -0.34%
However, the icount for 507.cactuBSSN increased by 0.43%. In addition,measurements on the BPI board show that the patch actually increasesexecution times by 5 to 11%.
This may still be beneficial for some uarchs but would have to betunable, wouldn't it?
Is worth proceeding with this?

So I looked a bit deeper at the instruction mix data for bwaves; I waskind of hoping to see something odd happening that would explain theperformance behavior, but no such luck.

If we were running into something weird like a failure to hoist a memoryreference out of a loop in the vlse64 version we'd see significantdiscrepancies in how the icounts change.

In the original code we have approx 14b fld instructions and 11bvfmv.v.f instructions. After the change we have roughly 11b vlse64instructions, 3b fld instructions and virtually no vfmv.v.f. It'salmost an exact match for what one would expect.


All the meaningful changes happen in one qemu translation block

ORIG:

  0x00000000000192c2   60057002880 27.4641% mat_times_vec_
      mv                      t4,s4
      add                     a1,a5,s10
      add                     a6,a5,s11
      snez                    a4,a4
      ld                      a3,16(sp)
      mv                      t3,s3
      fld                     fa2,0(a1)
      add                     a1,a5,s7
      fld                     fa1,0(a6)
      neg                     a4,a4
      mv                      t1,s2
      mv                      a6,s0
      fld                     fa4,0(a1)
      vmv.v.x                 v0,a4
      mv                      a0,t2
      mv                      a1,t0
      mv                      a4,s5
      sh3add                  a7,a2,a5
      ld                      a2,8(sp)
      fld                     fa0,0(a7)
      vmsne.vi                v0,v0,0
      mv                      a7,s1
      vfmv.v.f                v15,ft0
      vfmv.v.f                v13,fa1
      sh3add                  a2,a2,a5
      add                     a5,a5,s6
      vfmv.v.f                v12,fa2
      fld                     fa3,0(a2)
      fld                     fa5,0(a5)
      vfmv.v.f                v10,fa4
      mv                      a2,s5
      vfmv.v.f                v14,fa0
      vfmv.v.f                v11,fa3
      vfmv.v.f                v9,fa5

nopnopvsetvli a5,a3,e64,m1,ta,ma


NEW:

  0x00000000000192b8   56,810,678,400 27.0298% mat_times_vec_
      mv                      t4,s4
      mv                      t3,s3
      mv                      t1,s2
      add                     a3,a5,s10
      sh3add                  a4,s5,a5
      sh3add                  a1,s7,a5
      add                     a2,a5,s11
      vlse64.v                v9,(t6),zero
      mv                      a7,s1
      mv                      a6,s0
      mv                      a0,t2
      vlse64.v                v13,(a3),zero
      ld                      a3,24(sp)
      sd                      a3,8(sp)
      vlse64.v                v12,(a4),zero
      ld                      a4,16(sp)
      add                     a4,a4,a5
      add                     a5,a5,a3
      ld                      a3,32(sp)
      vlse64.v                v10,(a5),zero
      vlse64.v                v11,(a4),zero
      addi                    a4,t5,-1
      snez                    a4,a4
      neg                     a4,a4
      vmv.v.x                 v0,a4
      mv                      a4,s6
      vmsne.vi                v0,v0,0
      vlse64.v                v15,(a1),zero
      mv                      a1,t0
      vlse64.v                v14,(a2),zero
      mv                      a2,s6

nopnopnopvsetvli a5,a3,e64,m1,ta,ma

Which again looks exactly like one would expect from this optimization.I haven't verified with 100% certainty, but I'm pretty sure the vectorsin question are full 8x64bit doubles based on finding what I'm fairlysure is the vsetvl controlling these instructions.

I can only conclude that the optimization is behaving per design andthat our uarch isn't handling this idiom performantly in the FP domain.

So what I would suggest would be to add another tuning flag so that wecan distinguish between FP and integer cases and make this changeconditional on the uarch asking for this behavior.

Given we haven't yet seen a design where this is profitable, just makeit false across the board for all the upstreamed uarchs except -Os wherewe likely want it on. Obviously it's disappointing, but I wouldn't wantto lose the work as I do think this performance quirk we're seeing willbe fixed in future designs.


Other thoughts?

Jeff

Re: [PATCH][RFC] RISC-V: Allow FP strided broadcast from memory [PR121451]

Reply via email to