That would definitely be nice to have for both gather and stride loads
I'm not sure I like the direction that's heading ;)
So the loop I'm targeting is x264's satd:
for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
{
a0 = (pix1[0] - pix2[0])...
a1 = (pix1[1] - pix2[1])...
a2 = (pix1[2] - pix2[2])...
a3 = (pix1[3] - pix2[3])...
where DR_STEP is known but non-constant, so STMT_VINFO_STRIDED_P = true.
Right now we always set VMAT_STRIDED_SLP when STMT_VINFO_STRIDED_P.
For the single-element case (and only for that one AFAICT) we can switch to
VMAT_GATHER_SCATTER. Is the idea to relax that and also allow "strided"
gather/scatter for larger groups, involving composition types in particular?
Or maybe I missed the point.
One complication with that is that generic gather/scatter on riscv is also
pretty slow, not sure if it's as bad as on x86 but certainly only rarely a win.
At least right now I'm having a hard time imagining which strategy will be
faster and I'd be more comfortable with a costing decision rather than a static
switch. And we don't compare costs for different strategies but just choose
one for a specific mode. Of course, in the end VMAT_STRIDED_SLP usually
performs scalar loads in order to construct a vector but vector-vector loads
and construction is also possible. Maybe that's better than
gather/scatter/strided. I would need to compare a few cases for real to get
a better feeling of it.
If I didn't miss the point I could give it a shot. Maybe my complicated
dynamic-dispatch scheme for groups larger than the largest vector unit would
fit in there as well then (PR120639).
--
Regards
Robin