> Sifive core has that optimization for part of the cores like x280, but not
> for p470/p670, and seems like Tenstorrent Ascalon also doing that
> optimization as well? (they set that on both LLVM and GCC).

Does having that optimization imply that it is indeed as fast or faster than a 
scalar load and a broadcast in terms of latency and throughput?
IMHO we have three hardware "tiers" (slow, fast but worse than scalar, same as 
or better than scalar) but the switch is only binary.  Our design has the 
optimization but using scalar + broadcast is still faster and the same is true 
for the Banana Pi.  So even though it is "fast" we'd still disable it.

I'm not sure about Ascalon, their public numbers are from before camel-cdr 
updated his benchmark to include zero strides.

-- 
Regards
 Robin

Reply via email to