That would definitely be nice to have for both gather and stride loads

I'm not sure I like the direction that's heading ;)

So the loop I'm targeting is x264's satd:

   for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
   {
        a0 = (pix1[0] - pix2[0])...
        a1 = (pix1[1] - pix2[1])...
        a2 = (pix1[2] - pix2[2])...
        a3 = (pix1[3] - pix2[3])...

where DR_STEP is known but non-constant, so STMT_VINFO_STRIDED_P = true.

Right now we always set VMAT_STRIDED_SLP when STMT_VINFO_STRIDED_P.

For the single-element case (and only for that one AFAICT) we can switch to VMAT_GATHER_SCATTER. Is the idea to relax that and also allow "strided" gather/scatter for larger groups, involving composition types in particular? Or maybe I missed the point.

One complication with that is that generic gather/scatter on riscv is also pretty slow, not sure if it's as bad as on x86 but certainly only rarely a win.

At least right now I'm having a hard time imagining which strategy will be faster and I'd be more comfortable with a costing decision rather than a static switch. And we don't compare costs for different strategies but just choose one for a specific mode. Of course, in the end VMAT_STRIDED_SLP usually performs scalar loads in order to construct a vector but vector-vector loads and construction is also possible. Maybe that's better than gather/scatter/strided. I would need to compare a few cases for real to get
a better feeling of it.

If I didn't miss the point I could give it a shot. Maybe my complicated dynamic-dispatch scheme for groups larger than the largest vector unit would fit in there as well then (PR120639).

--
Regards
Robin

Reply via email to