> I built A/B binaries for bwaves and just ran input #1 on design.  The results 
> roughly math yours.  About a 5% regression in performance with a 5% 
> improvement in icount.
>
> We do have recognition of the zero stride load idiom in our design and 
> it works for integer sources.  The fact that FP performs so poorly is 
> quite a surprise.  Though this top line behavior does match what we're 
> seeing on the BPI as well.
>
> I'm getting some data with perf record to see if there's perhaps 
> something goofy going on that can be easily spotted.   What doesn't make 
> much sense here is our LSU shouldn't really care about the underlying 
> data types.

If it's something on compiler level then, as written in my first response, I 
suspect a failure to hoist something or LCM being inhibited by the mem.  When 
treating a mem as regular broadcast we will force it to a register in the 
pred_broadcast expander while we're not doing the same in the 
pred_strided_broadcast expander.  In any way, that should be visible in qemu 
collect data.

-- 
Regards
 Robin

Reply via email to