On Tue, Nov 4, 2025 at 8:57 PM Robin Dapp <[email protected]> wrote: > > > Sifive core has that optimization for part of the cores like x280, but not > > for p470/p670, and seems like Tenstorrent Ascalon also doing that > > optimization as well? (they set that on both LLVM and GCC). > > Does having that optimization imply that it is indeed as fast or faster than a > scalar load and a broadcast in terms of latency and throughput? > IMHO we have three hardware "tiers" (slow, fast but worse than scalar, same as > or better than scalar) but the switch is only binary. Our design has the > optimization but using scalar + broadcast is still faster and the same is true > for the Banana Pi. So even though it is "fast" we'd still disable it.
Yeah, that's faster than scalar + broadcast in x280 and all other SiFive Intelligence cores. > I'm not sure about Ascalon, their public numbers are from before camel-cdr > updated his benchmark to include zero strides. > > -- > Regards > Robin >
