https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120639
--- Comment #1 from Robin Dapp <rdapp at gcc dot gnu.org> --- I'm just realizing that without knowing the stride statically, we'd generate a lot of code as we don't have a way of setting an element size for loads dynamically. Although riscv offers a dynamic element size it does not apply to loads/stores. So it would look something like: for (elsz : [64 32 16 8]) { while (i_width > elsz) { switch (elsz) { 64: // code for 64-bit loads 32: // code for 32-bit loads ... } } } Or swap outer and inner loop in order to interleave the loads. One approach is basically to re-vectorize the loop with decreasing element size and adjusted datarefs, the other one is to interleave the element sizes for each individual operation (load, arith, store) until i_width is reached. Masking doesn't help here as the intention is to load full elements. For compile-time known strides we might at least not need all the overhead but could only generate those element-size-loops we really need. We would still need virtual vector modes (like e.g. V24QI) that could be handled by the vectorizable_* functions via lowering. That would be similar to the interleaving above and would at least double the number of registers. I'm just thinking out loud here, of course. I didn't manage to convince myself that anything of this is compelling :)