https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118734

            Bug ID: 118734
           Summary: RISC-V: Vector broadcast via strided load.
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rdapp at gcc dot gnu.org
                CC: jeffreyalaw at gmail dot com, juzhe.zhong at rivai dot ai,
                    kito.cheng at gmail dot com, palmer at dabbelt dot com,
                    pan2.li at intel dot com, vineetg at rivosinc dot com
  Target Milestone: ---
            Target: riscv

We currently use the strided-load broadcast (with zero stride) idiom
unconditionally when we need to duplicate an element in memory into a vector
register:

     vlse32  v1,(a0),x0

The alternative is

     lw      a1,0(a0)
     vmv.v.x v1,a1

The latter is more explicit and more likely to be implemented efficiently on
microarchitectures while the former requires a special cased strided load.

I'd argue that we want at least a uarch tunable to switch the behavior and,
given that the special case might be rare, have the non-strided variant be the
default.

A second use case of the idiom is broadcasting 64-bit elements on an rv32
target.  A corresponding vector sequence would be longer (load the two halves
individually and shift them into place) but likely still better on the majority
of uarchs?

Reply via email to