> I built A/B binaries for bwaves and just ran input #1 on design. The results > roughly math yours. About a 5% regression in performance with a 5% > improvement in icount. > > We do have recognition of the zero stride load idiom in our design and > it works for integer sources. The fact that FP performs so poorly is > quite a surprise. Though this top line behavior does match what we're > seeing on the BPI as well. > > I'm getting some data with perf record to see if there's perhaps > something goofy going on that can be easily spotted. What doesn't make > much sense here is our LSU shouldn't really care about the underlying > data types.
If it's something on compiler level then, as written in my first response, I suspect a failure to hoist something or LCM being inhibited by the mem. When treating a mem as regular broadcast we will force it to a register in the pred_broadcast expander while we're not doing the same in the pred_strided_broadcast expander. In any way, that should be visible in qemu collect data. -- Regards Robin
