https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118057
Andrew Waterman <andrew at sifive dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |andrew at sifive dot com --- Comment #7 from Andrew Waterman <andrew at sifive dot com> --- It is a more advanced optimization, but these known-constant-stride cases can sometimes be more efficiently vectorized using masked unit-stride loads and stores. (Implementations I've worked on execute the masked variants of these instructions only slightly less efficiently than the unmasked ones.) For example: vsetivli x0, 25, e32, m8, ta, ma li t0, 0x1111111 vmv.s.x v0, t0 loop: vle32.v v8, (a5), v0.t vse32.v v8, (a4), v0.t addi a5, a5, 512 addi a4, a4, 512 bgeu a1, a5, loop