https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122028

            Bug ID: 122028
           Summary: vect: Known vs variable stride
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rdapp at gcc dot gnu.org
  Target Milestone: ---

The following is from x264's sub4x4dct:

#define FENC_STRIDE 16

typedef short int16_t;
typedef unsigned char uint8_t;

void pixel_sub_wxh(int16_t diff[16], int i_size, uint8_t *pix1, int i_pix1,
uint8_t *pix2, int i_pix2 )
{
    for( int y = 0; y < 4; y++ )
    {
        for( int x = 0; x < 4; x++ )
            diff[x + y*4] = pix1[x] - pix2[x];
        pix1 += FENC_STRIDE;
        pix2 += FENC_STRIDE;
    }
}

Ideally we'd like to vectorize this with a VF of 16 (and strided loads).  This
works if the strides are variable (i_pix1, i_pix2) as we can then go the
STRIDED_SLP -> gather/strided route.

        vsetivli        zero,4,e32,mf2,ta,ma
        vlse32.v        v3,0(a2),a3
        vlse32.v        v2,0(a4),a5
        vsetivli        zero,16,e8,mf4,ta,ma
        vwsubu.vv       v1,v3,v2
        vse16.v v1,0(a0)

In the benchmark, i_pix1/i_pix2 are constant in the caller and the function
gets inlined.

Then, with the strides fixed we immediately hit

  if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
      && vect_known_niters_smaller_than_vf (loop_vinfo))
    {
      if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                         "not vectorized: iteration count smaller than "
                         "vectorization factor.\n");
    }

and
      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
        {
          /* Check that the loop processes at least one full vector.  */
          poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
          if (known_lt (scalar_niters, vf))
                ...
                dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                                 "loop does not have enough iterations "
                                 "to support vectorization.\n");

Of course the assumption that 4 iterations for a VF=16 are not helpful is
basically sane.  It doesn't take into account the 4x unrolling that happens
later which would enable strided code-gen, though.

I haven't checked further but, at least for this particular case, making a
"bail" decision only later in the pipeline can be advantageous. 

Of course, there is still the difference that we don't set strided_p if the
stride is constant which makes us take different turns and eventually generate
an interleave,even-odd scheme.  ISTM we should be able to treat the constant
case similar to the variable case at first and only bifurcate later once we
don't give up early.

Reply via email to