>> Why +1, +2, +3 when the actual data processed is *2, *4, *8?  I'd scale by
>> that, as usually the latency is also similarly affected.
>
> Good point. I understand "cost" is an abstract concept, so I wasn't sure if
> it should scale with LMUL the same way hardware instruction's 
> latency/throughput does.

Yeah, generally "cost" is abstract and means different things at different 
points.  For vector costing we want to compare scalar to vector, though, and 
latency/throughput directly matter when compared to scalar.  Scalar costs are 
normally simple, i.e. everything costs "1".  With a superscalar uarch we would 
multiply the vector costs by e.g. 2 to account for 4 scalar ALUs vs 2 vector 
ALUs.  Once this scaling is out of the way we more or less directly compare 
latency.  So if a vector op takes 4 cycles at LMUL1 it might take 8 cycles at 
LMUL2.  This also means that in those 8 cycles 2x the number of scalar ops 
could execute.

We don't have this "scalar" scaling part right now, though.  Mostly because 
we're unsure about which machine to target.  With better hardware availability 
this should change soon, though.

Do you have a specific uarch in mind?

-- 
Regards
 Robin

Reply via email to