On 9/23/25 13:39, Jeff Law wrote:
>
> On 9/23/25 1:45 PM, Paul-Antoine Arras wrote:
>> I experimented with this patch which allows to remove a vfmv when a 
>> floating-point op can be loaded directly from memory with a zero-stride 
>> vlse.
>>
>> In terms of benchmarks, I measured the following reductions in icount:
>> * 503.bwaves: -4.0%
>> * 538.imagick: -3.3%
>> * 549.fotonik3d: -0.34%
>>
>> However, the icount for 507.cactuBSSN increased by 0.43%. In addition, 
>> measurements on the BPI board show that the patch actually increases 
>> execution times by 5 to 11%.
>>
>> This may still be beneficial for some uarchs but would have to be 
>> tunable, wouldn't it?
>> Is worth proceeding with this?
> It's probably worth investigating.  DO you happen to have A/B binaries 
> handy still?  I could throw them onto our design.

FWIW they will perform poorly on our design: similar to integer zero-stride
loads for broadcasts.

> Austin and I tested the BPI for the zero-strided load idiom, but just on 
> the integer side and it looked like it likely supported optimizing those 
> into a single load + an internal broadcast across the vector.  So it's a 
> bit of a surprise to see it not performing well at all for FP.
>
> Note there is an entry in the riscv_tune_param structure controlling the 
> zero-stride idiom.  So you could test that quite easily and assuming the 
> port had things defined properly it would just work.

Thx,
-Vineet

Reply via email to