https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118734
Bug ID: 118734 Summary: RISC-V: Vector broadcast via strided load. Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org CC: jeffreyalaw at gmail dot com, juzhe.zhong at rivai dot ai, kito.cheng at gmail dot com, palmer at dabbelt dot com, pan2.li at intel dot com, vineetg at rivosinc dot com Target Milestone: --- Target: riscv We currently use the strided-load broadcast (with zero stride) idiom unconditionally when we need to duplicate an element in memory into a vector register: vlse32 v1,(a0),x0 The alternative is lw a1,0(a0) vmv.v.x v1,a1 The latter is more explicit and more likely to be implemented efficiently on microarchitectures while the former requires a special cased strided load. I'd argue that we want at least a uarch tunable to switch the behavior and, given that the special case might be rare, have the non-strided variant be the default. A second use case of the idiom is broadcasting 64-bit elements on an rv32 target. A corresponding vector sequence would be longer (load the two halves individually and shift them into place) but likely still better on the majority of uarchs?