[Bug tree-optimization/122028] New: vect: Known vs variable stride

rdapp at gcc dot gnu.org via Gcc-bugs Mon, 22 Sep 2025 03:29:45 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122028


            Bug ID: 122028
           Summary: vect: Known vs variable stride
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rdapp at gcc dot gnu.org
  Target Milestone: ---

The following is from x264's sub4x4dct:

#define FENC_STRIDE 16

typedef short int16_t;
typedef unsigned char uint8_t;

void pixel_sub_wxh(int16_t diff[16], int i_size, uint8_t *pix1, int i_pix1,
uint8_t *pix2, int i_pix2 )
{
    for( int y = 0; y < 4; y++ )
    {
        for( int x = 0; x < 4; x++ )
            diff[x + y*4] = pix1[x] - pix2[x];
        pix1 += FENC_STRIDE;
        pix2 += FENC_STRIDE;
    }
}

Ideally we'd like to vectorize this with a VF of 16 (and strided loads).  This
works if the strides are variable (i_pix1, i_pix2) as we can then go the
STRIDED_SLP -> gather/strided route.

        vsetivli        zero,4,e32,mf2,ta,ma
        vlse32.v        v3,0(a2),a3
        vlse32.v        v2,0(a4),a5
        vsetivli        zero,16,e8,mf4,ta,ma
        vwsubu.vv       v1,v3,v2
        vse16.v v1,0(a0)

In the benchmark, i_pix1/i_pix2 are constant in the caller and the function
gets inlined.

Then, with the strides fixed we immediately hit

  if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
      && vect_known_niters_smaller_than_vf (loop_vinfo))
    {
      if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                         "not vectorized: iteration count smaller than "
                         "vectorization factor.\n");
    }

and
      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
        {
          /* Check that the loop processes at least one full vector.  */
          poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
          if (known_lt (scalar_niters, vf))
                ...
                dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                                 "loop does not have enough iterations "
                                 "to support vectorization.\n");

Of course the assumption that 4 iterations for a VF=16 are not helpful is
basically sane.  It doesn't take into account the 4x unrolling that happens
later which would enable strided code-gen, though.

I haven't checked further but, at least for this particular case, making a
"bail" decision only later in the pipeline can be advantageous. 

Of course, there is still the difference that we don't set strided_p if the
stride is constant which makes us take different turns and eventually generate
an interleave,even-odd scheme.  ISTM we should be able to treat the constant
case similar to the variable case at first and only bifurcate later once we
don't give up early.

[Bug tree-optimization/122028] New: vect: Known vs variable stride

Reply via email to