>> Why +1, +2, +3 when the actual data processed is *2, *4, *8? I'd scale by >> that, as usually the latency is also similarly affected. > > Good point. I understand "cost" is an abstract concept, so I wasn't sure if > it should scale with LMUL the same way hardware instruction's > latency/throughput does.
Yeah, generally "cost" is abstract and means different things at different points. For vector costing we want to compare scalar to vector, though, and latency/throughput directly matter when compared to scalar. Scalar costs are normally simple, i.e. everything costs "1". With a superscalar uarch we would multiply the vector costs by e.g. 2 to account for 4 scalar ALUs vs 2 vector ALUs. Once this scaling is out of the way we more or less directly compare latency. So if a vector op takes 4 cycles at LMUL1 it might take 8 cycles at LMUL2. This also means that in those 8 cycles 2x the number of scalar ops could execute. We don't have this "scalar" scaling part right now, though. Mostly because we're unsure about which machine to target. With better hardware availability this should change soon, though. Do you have a specific uarch in mind? -- Regards Robin
