https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123631
--- Comment #3 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- (In reply to Richard Biener from comment #2) > (In reply to Hongtao Liu from comment #1) > > It's done by r12-1958, it's better for dcache, but worse for icache, small > > benchmark in the commit show broadcast from integer is slightly better than > > constant pool, maybe we should make it as a u-arch specific tuning. > > I see it was benchmarked on Intel CPU which have a shared register file, I > was specifically wondering of the AMD case where any integer <-> FP/vector > boundary crossing incurs a latency penalty. > > If there's already code generation using broadcast from scalar memory > a tunable would be nice to have given that makes benchmarking such > change easy. I'll write a patch for this. Should be able to upstream it next week.
