https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120639

--- Comment #1 from Robin Dapp <rdapp at gcc dot gnu.org> ---
I'm just realizing that without knowing the stride statically, we'd generate a
lot of code as we don't have a way of setting an element size for loads
dynamically.  Although riscv offers a dynamic element size it does not apply to
loads/stores.

So it would look something like:

for (elsz : [64 32 16 8])
{
  while (i_width > elsz)
  {
     switch (elsz)
     {
        64:
          // code for 64-bit loads
        32:
          // code for 32-bit loads
        ...
     }
  }
}

Or swap outer and inner loop in order to interleave the loads.  One approach is
basically to re-vectorize the loop with decreasing element size and adjusted
datarefs, the other one is to interleave the element sizes for each individual
operation (load, arith, store) until i_width is reached. 

Masking doesn't help here as the intention is to load full elements.

For compile-time known strides we might at least not need all the overhead but
could only generate those element-size-loops we really need.

We would still need virtual vector modes (like e.g. V24QI) that could be
handled by the vectorizable_* functions via lowering.  That would be similar to
the interleaving above and would at least double the number of registers.

I'm just thinking out loud here, of course.  I didn't manage to convince myself
that anything of this is compelling :)

Reply via email to