> If we were running into something weird like a failure to hoist a memory 
> reference out of a loop in the vlse64 version we'd see significant 
> discrepancies in how the icounts change.
>
> In the original code we have approx 14b fld instructions and 11b 
> vfmv.v.f instructions.  After the change we have roughly 11b vlse64 
> instructions, 3b fld instructions and virtually no vfmv.v.f.  It's 
> almost an exact match for what one would expect.
>
> All the meaningful changes happen in one qemu translation block
>
> ORIG:
>>   0x00000000000192c2   60057002880 27.4641% mat_times_vec_
>>       mv                      t4,s4
>>       add                     a1,a5,s10
>>       add                     a6,a5,s11
>>       snez                    a4,a4
>>       ld                      a3,16(sp)
>>       mv                      t3,s3
>>       fld                     fa2,0(a1)
>>       add                     a1,a5,s7
>>       fld                     fa1,0(a6)
>>       neg                     a4,a4
>>       mv                      t1,s2
>>       mv                      a6,s0
>>       fld                     fa4,0(a1)
>>       vmv.v.x                 v0,a4
>>       mv                      a0,t2
>>       mv                      a1,t0
>>       mv                      a4,s5
>>       sh3add                  a7,a2,a5
>>       ld                      a2,8(sp)
>>       fld                     fa0,0(a7)
>>       vmsne.vi                v0,v0,0
>>       mv                      a7,s1
>>       vfmv.v.f                v15,ft0
>>       vfmv.v.f                v13,fa1
>>       sh3add                  a2,a2,a5
>>       add                     a5,a5,s6
>>       vfmv.v.f                v12,fa2
>>       fld                     fa3,0(a2)
>>       fld                     fa5,0(a5)
>>       vfmv.v.f                v10,fa4
>>       mv                      a2,s5
>>       vfmv.v.f                v14,fa0
>>       vfmv.v.f                v11,fa3
>>       vfmv.v.f                v9,fa5
>>       nop                     
>>       nop                  
>>       vsetvli                 a5,a3,e64,m1,ta,ma
>
> NEW:
>>   0x00000000000192b8   56,810,678,400 27.0298% mat_times_vec_
>>       mv                      t4,s4
>>       mv                      t3,s3
>>       mv                      t1,s2
>>       add                     a3,a5,s10
>>       sh3add                  a4,s5,a5
>>       sh3add                  a1,s7,a5
>>       add                     a2,a5,s11
>>       vlse64.v                v9,(t6),zero
>>       mv                      a7,s1
>>       mv                      a6,s0
>>       mv                      a0,t2
>>       vlse64.v                v13,(a3),zero
>>       ld                      a3,24(sp)
>>       sd                      a3,8(sp)
>>       vlse64.v                v12,(a4),zero
>>       ld                      a4,16(sp)
>>       add                     a4,a4,a5
>>       add                     a5,a5,a3
>>       ld                      a3,32(sp)
>>       vlse64.v                v10,(a5),zero
>>       vlse64.v                v11,(a4),zero
>>       addi                    a4,t5,-1
>>       snez                    a4,a4
>>       neg                     a4,a4
>>       vmv.v.x                 v0,a4
>>       mv                      a4,s6
>>       vmsne.vi                v0,v0,0
>>       vlse64.v                v15,(a1),zero
>>       mv                      a1,t0
>>       vlse64.v                v14,(a2),zero
>>       mv                      a2,s6
>>       nop                     
>>       nop                     
>>       nop                     
>>       vsetvli                 a5,a3,e64,m1,ta,ma

Is this the full BB?  Just wondering because I'm seeing 7 vlse64 in the new 
block, 7 vfmv.f in the old block but only 6 flds.  This would imply that one
of the flds (that loads ft0) was outside the block before but now we're always 
reloading it inside as opposed to just broadcasting?  Not saying that will 
account for the icount difference but is at least no 1-1 translation like
"fld + vfmv.f = vlse64".

> Which again looks exactly like one would expect from this optimization. I 
> haven't verified with 100% certainty, but I'm pretty sure the vectors in 
> question are full 8x64bit doubles based on finding what I'm fairly sure is 
> the vsetvl controlling these instructions.
>
> I can only conclude that the optimization is behaving per design and 
> that our uarch isn't handling this idiom performantly in the FP domain.

Couldn't another interpretation be:
"An optimized zero strided load is faster than a regular strided load but still 
slower than vfmv.f (+fld)"?

-- 
Regards
 Robin

Reply via email to