https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122028
Bug ID: 122028
Summary: vect: Known vs variable stride
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: rdapp at gcc dot gnu.org
Target Milestone: ---
The following is from x264's sub4x4dct:
#define FENC_STRIDE 16
typedef short int16_t;
typedef unsigned char uint8_t;
void pixel_sub_wxh(int16_t diff[16], int i_size, uint8_t *pix1, int i_pix1,
uint8_t *pix2, int i_pix2 )
{
for( int y = 0; y < 4; y++ )
{
for( int x = 0; x < 4; x++ )
diff[x + y*4] = pix1[x] - pix2[x];
pix1 += FENC_STRIDE;
pix2 += FENC_STRIDE;
}
}
Ideally we'd like to vectorize this with a VF of 16 (and strided loads). This
works if the strides are variable (i_pix1, i_pix2) as we can then go the
STRIDED_SLP -> gather/strided route.
vsetivli zero,4,e32,mf2,ta,ma
vlse32.v v3,0(a2),a3
vlse32.v v2,0(a4),a5
vsetivli zero,16,e8,mf4,ta,ma
vwsubu.vv v1,v3,v2
vse16.v v1,0(a0)
In the benchmark, i_pix1/i_pix2 are constant in the caller and the function
gets inlined.
Then, with the strides fixed we immediately hit
if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
&& vect_known_niters_smaller_than_vf (loop_vinfo))
{
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"not vectorized: iteration count smaller than "
"vectorization factor.\n");
}
and
if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
{
/* Check that the loop processes at least one full vector. */
poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
if (known_lt (scalar_niters, vf))
...
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
"loop does not have enough iterations "
"to support vectorization.\n");
Of course the assumption that 4 iterations for a VF=16 are not helpful is
basically sane. It doesn't take into account the 4x unrolling that happens
later which would enable strided code-gen, though.
I haven't checked further but, at least for this particular case, making a
"bail" decision only later in the pipeline can be advantageous.
Of course, there is still the difference that we don't set strided_p if the
stride is constant which makes us take different turns and eventually generate
an interleave,even-odd scheme. ISTM we should be able to treat the constant
case similar to the variable case at first and only bifurcate later once we
don't give up early.