https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118057

Andrew Waterman <andrew at sifive dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |andrew at sifive dot com

--- Comment #7 from Andrew Waterman <andrew at sifive dot com> ---
It is a more advanced optimization, but these known-constant-stride cases can
sometimes be more efficiently vectorized using masked unit-stride loads and
stores.  (Implementations I've worked on execute the masked variants of these
instructions only slightly less efficiently than the unmasked ones.)  For
example:

  vsetivli x0, 25, e32, m8, ta, ma
  li t0, 0x1111111
  vmv.s.x v0, t0

loop:
  vle32.v v8, (a5), v0.t
  vse32.v v8, (a4), v0.t
  addi a5, a5, 512
  addi a4, a4, 512
  bgeu a1, a5, loop

Reply via email to