Re: [PATCH] tree-optimization/115252 - enhance peeling for gaps avoidance

Richard Sandiford Wed, 29 May 2024 02:26:09 -0700

Richard Biener <rguent...@suse.de> writes:
> Code generation for contiguous load vectorization can already deal
> with generalized avoidance of loading from a gap.  The following
> extends detection of peeling for gaps requirement with that,
> gets rid of the old special casing of a half load and makes sure
> when we do access the gap we have peeling for gaps enabled.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> This is the first patch in a series to improve peeling for gaps,
> it turned out into an improvement for code rather than just doing
> the (delayed from stage3) removal of the "old" half-vector codepath.
>
> I'll wait for the pre-CI testing for pushing so you also have time
> for some comments.


LGTM FWIW (some trivia below).

Out of interest, how far are we off being able to load:

    a[i*8+0]
    a[i*8+1]
    a[i*8+3]
    a[i*8+4]

as two half vectors?  It doesn't look like we're quite there yet,
but I might have misread.

It would be nice if we could eventually integrate the overrun_p checks
with the vectorizable_load code that the code is trying to predict.
E.g. we could run through the vectorizable_load code during the
analysis phase and record overruns, similarly to Kewen's costing
patches.  As it stands, it seems difficult to make sure that the two
checks are exactly in sync, especially when the structure is so
different.

> Richard.
>
>       PR tree-optimization/115252
>       * tree-vect-stmts.cc (get_group_load_store_type): Enhance
>       detecting the number of cases where we can avoid accessing a gap
>       during code generation.
>       (vectorizable_load): Remove old half-vector peeling for gap
>       avoidance which is now redundant.  Add gap-aligned case where
>       it's OK to access the gap.  Add assert that we have peeling for
>       gaps enabled when we access a gap.
>
>       * gcc.dg/vect/slp-gap-1.c: New testcase.
> ---
>  gcc/testsuite/gcc.dg/vect/slp-gap-1.c | 18 +++++++++
>  gcc/tree-vect-stmts.cc                | 58 +++++++++++++--------------
>  2 files changed, 46 insertions(+), 30 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-gap-1.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-gap-1.c 
> b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> new file mode 100644
> index 00000000000..36463ca22c5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3" } */
> +
> +typedef unsigned char uint8_t;
> +typedef short int16_t;
> +void pixel_sub_wxh(int16_t * __restrict diff, uint8_t *pix1, uint8_t *pix2) {
> +  for (int y = 0; y < 4; y++) {
> +    for (int x = 0; x < 4; x++)
> +      diff[x + y * 4] = pix1[x] - pix2[x];
> +    pix1 += 16;
> +    pix2 += 32;
> +  }
> +}
> +
> +/* We can vectorize this without peeling for gaps and thus without epilogue,
> +   but the only thing we can reliably scan is the zero-padding trick for the
> +   partial loads.  */
> +/* { dg-final { scan-tree-dump-times "\{_\[0-9\]\+, 0" 6 "vect" { target 
> vect64 } } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index a01099d3456..b26cc74f417 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -2072,16 +2072,22 @@ get_group_load_store_type (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>         dr_alignment_support alss;
>         int misalign = dr_misalignment (first_dr_info, vectype);
>         tree half_vtype;
> +       poly_uint64 remain;
> +       unsigned HOST_WIDE_INT tem, num;
>         if (overrun_p
>             && !masked_p
>             && (((alss = vect_supportable_dr_alignment (vinfo, first_dr_info,
>                                                         vectype, misalign)))
>                  == dr_aligned
>                 || alss == dr_unaligned_supported)
> -           && known_eq (nunits, (group_size - gap) * 2)
> -           && known_eq (nunits, group_size)
> -           && (vector_vector_composition_type (vectype, 2, &half_vtype)
> -               != NULL_TREE))
> +           && can_div_trunc_p (group_size
> +                               * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - gap,
> +                               nunits, &tem, &remain)
> +           && (known_eq (remain, 0u)
> +               || (constant_multiple_p (nunits, remain, &num)
> +                   && (vector_vector_composition_type (vectype, num,
> +                                                       &half_vtype)
> +                       != NULL_TREE))))
>           overrun_p = false;

Might be worth renaming half_vtype now that it isn't necessarily
a strict half.

>  
>         if (overrun_p && !can_overrun_p)
> @@ -11533,33 +11539,14 @@ vectorizable_load (vec_info *vinfo,
>                   unsigned HOST_WIDE_INT gap = DR_GROUP_GAP (first_stmt_info);
>                   unsigned int vect_align
>                     = vect_known_alignment_in_bytes (first_dr_info, vectype);
> -                 unsigned int scalar_dr_size
> -                   = vect_get_scalar_dr_size (first_dr_info);
> -                 /* If there's no peeling for gaps but we have a gap
> -                    with slp loads then load the lower half of the
> -                    vector only.  See get_group_load_store_type for
> -                    when we apply this optimization.  */
> -                 if (slp
> -                     && loop_vinfo
> -                     && !LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && gap != 0
> -                     && known_eq (nunits, (group_size - gap) * 2)
> -                     && known_eq (nunits, group_size)
> -                     && gap >= (vect_align / scalar_dr_size))
> -                   {
> -                     tree half_vtype;
> -                     new_vtype
> -                       = vector_vector_composition_type (vectype, 2,
> -                                                         &half_vtype);
> -                     if (new_vtype != NULL_TREE)
> -                       ltype = half_vtype;
> -                   }
>                   /* Try to use a single smaller load when we are about
>                      to load excess elements compared to the unrolled
> -                    scalar loop.
> -                    ???  This should cover the above case as well.  */
> -                 else if (known_gt ((vec_num * j + i + 1) * nunits,
> +                    scalar loop.  */
> +                 if (known_gt ((vec_num * j + i + 1) * nunits,
>                                      (group_size * vf - gap)))

Missing reindentation.

Pre-existing, but shouldn't this be maybe_gt rather than known_gt?
We can only skip doing it if we know for sure that the load won't cross
the gap.  (Not sure whether the difference can trigger in practice.)

Thanks,
Richard

>                     {
> +                     poly_uint64 remain = ((group_size * vf - gap)
> +                                           - (vec_num * j + i) * nunits);
>                       if (known_ge ((vec_num * j + i + 1) * nunits
>                                     - (group_size * vf - gap), nunits))
>                         /* DR will be unused.  */
> @@ -11571,11 +11558,15 @@ vectorizable_load (vec_info *vinfo,
>                            at least one element is accessed in the
>                            scalar loop.  */
>                         ;
> +                     else if (known_gt (vect_align,
> +                                        ((nunits - remain)
> +                                         * vect_get_scalar_dr_size
> +                                             (first_dr_info))))
> +                       /* Aligned access to the gap area when there's
> +                          at least one element in it is OK.  */
> +                       ;
>                       else
>                         {
> -                         auto remain
> -                           = ((group_size * vf - gap)
> -                              - (vec_num * j + i) * nunits);
>                           /* remain should now be > 0 and < nunits.  */
>                           unsigned num;
>                           if (constant_multiple_p (nunits, remain, &num))
> @@ -11589,6 +11580,13 @@ vectorizable_load (vec_info *vinfo,
>                                 ltype = ptype;
>                             }
>                           /* Else use multiple loads or a masked load?  */
> +                         /* For loop vectorization we now should have
> +                            an alternate type or LOOP_VINFO_PEELING_FOR_GAPS
> +                            set.  */
> +                         if (loop_vinfo)
> +                           gcc_assert (new_vtype
> +                                       || LOOP_VINFO_PEELING_FOR_GAPS
> +                                            (loop_vinfo));
>                         }
>                     }
>                   tree offset

Re: [PATCH] tree-optimization/115252 - enhance peeling for gaps avoidance

Reply via email to