Re: [PATCH] vect: Use combined peeling and versioning for mutually aligned DRs

Alex Coplan Tue, 10 Jun 2025 03:47:48 -0700

Hi Pengfei,

This looks really good, I've just left a couple of small comments below.


On 06/06/2025 14:35, Pengfei Li wrote:
> Current GCC uses either peeling or versioning, but not in combination,
> to handle unaligned data references (DRs) during vectorization. This
> limitation causes some loops with early break to fall back to scalar
> code at runtime.
> 
> Consider the following loop with DRs in its early break condition:
> 
>       for (int i = start; i < end; i++) {
>         if (a[i] == b[i])
>           break;
>         count++;
>       }
> 
> In the loop, references to a[] and b[] need to be strictly aligned for
> vectorization because speculative reads that may cross page boundaries
> are not allowed. Current GCC does versioning for this loop by creating a
> runtime check like:
> 
>       ((&a[start] | &b[start]) & mask) == 0
> 
> to see if two initial addresses both have lower bits zeros. If above
> runtime check fails, the loop will fall back to scalar code. However,
> it's often possible that DRs are all unaligned at the beginning but they
> become all aligned after a few loop iterations. We call this situation
> DRs being "mutually aligned".
> 
> This patch enables combined peeling and versioning to avoid loops with
> mutually aligned DRs falling back to scalar code. Specifically, the
> function vect_peeling_supportable is updated in this patch to return a
> three-state enum indicating how peeling can make all unsupportable DRs
> aligned. In addition to previous true/false return values, a new state
> peeling_maybe_supported is used to indicate that peeling may be able to
> make these DRs aligned but we are not sure about it at compile time. In
> this case, peeling should be combined with versioning so that a runtime
> check will be generated to guard the peeled vectorized loop.
> 
> A new type of runtime check is also introduced for combined peeling and
> versioning. It's enabled when LOOP_VINFO_ALLOW_MUTUAL_ALIGNMENT is true.
> The new check tests if all DRs recorded in LOOP_VINFO_MAY_MISALIGN_STMTS
> have the same lower address bits. For above loop case, the new test will
> generate an XOR between two addresses, like:
> 
>       ((&a[start] ^ &b[start]) & mask) == 0
> 
> Therefore, if a and b have the same alignment step (element size) and
> the same offset from an alignment boundary, a peeled vectorized loop
> will run. This new runtime check also works for >2 DRs, with the LHS
> expression being:
> 
>       ((a1 ^ a2) | (a2 ^ a3) | (a3 ^ a4) | ... | (an-1 ^ an)) & mask
> 
> where ai is the address of i'th DR.
> 
> This patch is bootstrapped and regression tested on x86_64-linux-gnu,
> arm-linux-gnueabihf and aarch64-linux-gnu.
> 
> gcc/ChangeLog:
> 
>       * tree-vect-data-refs.cc (vect_peeling_supportable): Return new
>       enum values to indicate if combined peeling and versioning can
>       potentially support vectorization.
>       (vect_enhance_data_refs_alignment): Support combined peeling and
>       versioning in vectorization analysis.
>       * tree-vect-loop-manip.cc (vect_create_cond_for_align_checks):
>       Add a new type of runtime check for mutually aligned DRs.
>       * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Set
>       default value of allow_mutual_alignment in the initializer list.
>       * tree-vectorizer.h (enum peeling_support): Define type of
>       peeling support for function vect_peeling_supportable.
>       (LOOP_VINFO_ALLOW_MUTUAL_ALIGNMENT): New access macro.
> 
> gcc/testsuite/ChangeLog:
> 
>       * gcc.dg/vect/vect-early-break_133_pfa6.c: Adjust test.
> ---
>  .../gcc.dg/vect/vect-early-break_133_pfa6.c   |   2 +-
>  gcc/tree-vect-data-refs.cc                    | 168 ++++++++++++++----
>  gcc/tree-vect-loop-manip.cc                   |  96 +++++++---
>  gcc/tree-vect-loop.cc                         |   1 +
>  gcc/tree-vectorizer.h                         |  16 ++
>  5 files changed, 222 insertions(+), 61 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_133_pfa6.c 
> b/gcc/testsuite/gcc.dg/vect/vect-early-break_133_pfa6.c
> index ee123df6ed2..7787d037d9d 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_133_pfa6.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_133_pfa6.c
> @@ -20,4 +20,4 @@ unsigned test4(char x, char *vect_a, char *vect_b, int n)
>   return ret;
>  }
>  
> -/* { dg-final { scan-tree-dump "Versioning for alignment will be applied" 
> "vect" } } */
> +/* { dg-final { scan-tree-dump "Both peeling and versioning will be applied" 
> "vect" } } */
> diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> index 1792ee4ea05..befdbff29f3 100644
> --- a/gcc/tree-vect-data-refs.cc
> +++ b/gcc/tree-vect-data-refs.cc
> @@ -2111,9 +2111,10 @@ vect_peeling_hash_choose_best_peeling 
> (hash_table<peel_info_hasher> *peeling_hta
>     return res;
>  }
>  
> -/* Return true if the new peeling NPEEL is supported.  */
> +/* Return if vectorization is definitely, possibly, or unlikely to be
> +   supportable after loop peeling.  */
>  
> -static bool
> +static enum peeling_support
>  vect_peeling_supportable (loop_vec_info loop_vinfo, dr_vec_info *dr0_info,
>                         unsigned npeel)
>  {
> @@ -2123,8 +2124,11 @@ vect_peeling_supportable (loop_vec_info loop_vinfo, 
> dr_vec_info *dr0_info,
>    bool dr0_alignment_known_p
>      = known_alignment_for_access_p (dr0_info,
>                                   STMT_VINFO_VECTYPE (dr0_info->stmt));
> +  bool has_unsupported_dr_p = false;
> +  unsigned int dr0_step = DR_STEP_ALIGNMENT (dr0_info->dr);
> +  int known_unsupported_misalignment = DR_MISALIGNMENT_UNKNOWN;
>  
> -  /* Ensure that all data refs can be vectorized after the peel.  */
> +  /* Check if each data ref can be vectorized after peeling.  */
>    for (data_reference *dr : datarefs)
>      {
>        if (dr == dr0_info->dr)
> @@ -2152,10 +2156,44 @@ vect_peeling_supportable (loop_vec_info loop_vinfo, 
> dr_vec_info *dr0_info,
>       = vect_supportable_dr_alignment (loop_vinfo, dr_info, vectype,
>                                        misalignment);
>        if (supportable_dr_alignment == dr_unaligned_unsupported)
> -     return false;
> +     {
> +       has_unsupported_dr_p = true;
> +
> +       /* If unaligned unsupported DRs exist, we do following checks to see
> +          if they can be mutually aligned to support vectorization.  If yes,
> +          we can try peeling and create a runtime (mutual alignment) check
> +          to guard the peeled loop.  If no, return PEELING_UNSUPPORTED.  */
> +
> +       /* 1) If unaligned unsupported DRs have different alignment steps, the
> +             probability of DRs being mutually aligned is very low, and it's
> +             quite complex to check mutual alignment at runtime.  We return
> +             PEELING_UNSUPPORTED in this case.  */

It might be nice to at least experiment with supporting DRs with
different steps as follow-on work.  I agree that we should leave it out
for the first version to keep things simple.

FWIW, in case it's of interest, I wrote a script to calculate the
possible combinations of alignments that we could mutually peel for DRs
with different steps.  The script also calculates the probability of
having a viable combination of alignments at runtime, assuming the
alignments are chosen independently and uniformly at random.  These are
the results for N = 2 DRs (assuming target vector alignment is 16):

DR sizes: [2, 1]
-> DR alignments: [0, 8], peel_niters = 8
-> DR alignments: [2, 9], peel_niters = 7
-> DR alignments: [4, 10], peel_niters = 6
-> DR alignments: [6, 11], peel_niters = 5
-> DR alignments: [8, 12], peel_niters = 4
-> DR alignments: [10, 13], peel_niters = 3
-> DR alignments: [12, 14], peel_niters = 2
-> DR alignments: [14, 15], peel_niters = 1
Pr(success) = 8/127 = 6.30%

DR sizes: [4, 1]
-> DR alignments: [0, 12], peel_niters = 4
-> DR alignments: [4, 13], peel_niters = 3
-> DR alignments: [8, 14], peel_niters = 2
-> DR alignments: [12, 15], peel_niters = 1
Pr(success) = 4/63 = 6.35%

DR sizes: [8, 1]
-> DR alignments: [0, 14], peel_niters = 2
-> DR alignments: [8, 15], peel_niters = 1
Pr(success) = 2/31 = 6.45%

DR sizes: [4, 2]
-> DR alignments: [0, 8], peel_niters = 4
-> DR alignments: [4, 10], peel_niters = 3
-> DR alignments: [8, 12], peel_niters = 2
-> DR alignments: [12, 14], peel_niters = 1
Pr(success) = 4/31 = 12.90%

DR sizes: [8, 2]
-> DR alignments: [0, 12], peel_niters = 2
-> DR alignments: [8, 14], peel_niters = 1
Pr(success) = 2/15 = 13.33%

DR sizes: [8, 4]
-> DR alignments: [0, 8], peel_niters = 2
-> DR alignments: [8, 12], peel_niters = 1
Pr(success) = 2/7 = 28.57%

in general as the number of DRs increases, the probability of success
of this strategy tends to decrease.  The lowest probability combination
with N = 3 is:

DR sizes: [2, 1, 1]
-> DR alignments: [0, 8, 8], peel_niters = 8
-> DR alignments: [2, 9, 9], peel_niters = 7
-> DR alignments: [4, 10, 10], peel_niters = 6
-> DR alignments: [6, 11, 11], peel_niters = 5
-> DR alignments: [8, 12, 12], peel_niters = 4
-> DR alignments: [10, 13, 13], peel_niters = 3
-> DR alignments: [12, 14, 14], peel_niters = 2
-> DR alignments: [14, 15, 15], peel_niters = 1
Pr(success) = 8/2047 = 0.39%

but there are still some N = 3 combinations with reasonable chances of
success:

DR sizes: [8, 8, 2]
-> DR alignments: [0, 0, 12], peel_niters = 2
-> DR alignments: [8, 8, 14], peel_niters = 1
Pr(success) = 2/31 = 6.45%

DR sizes: [8, 4, 4]
-> DR alignments: [0, 8, 8], peel_niters = 2
-> DR alignments: [8, 12, 12], peel_niters = 1
Pr(success) = 2/31 = 6.45%

DR sizes: [8, 8, 4]
-> DR alignments: [0, 0, 8], peel_niters = 2
-> DR alignments: [8, 8, 12], peel_niters = 1
Pr(success) = 2/15 = 13.33%

So we could in theory attempt to calculate at compile time how likely
the mutual peeling strategy is to succeed, or otherwise use some
heuristic to decide whether to attempt it.  It seems like it might be
worth doing at least for N = 2, especially if it seems to help in
practice.

I think the runtime alignment check boils down to calculating how many
niters would be needed to peel the first DR, and then checking that you
get the same result for the other DRs.  I.e. checking:

distance(a1)/step(a1) = ... = distance(aN)/step(aN)

where distance(dr) gives the distance required to peel, i.e.
distance(a) = V - (a % V) where V is the target vector alignment.  Given
the steps and V are powers of two known at compile time, it should be
possible to optimize this check to be relatively efficient (although
likely not quite as fast as the xor check for the same-step case).

> +       if (DR_STEP_ALIGNMENT (dr) != dr0_step)
> +         return peeling_unsupported;
> +
> +       /* 2) Based on above same alignment step condition, if one known
> +             misaligned DR has zero misalignment, or different misalignment
> +             amount from another known misaligned DR, peeling is unable to
> +             help make all these DRs aligned together.  We won't try peeling
> +             with versioning anymore.  */
> +       int curr_dr_misalignment = dr_misalignment (dr_info, vectype);
> +       if (curr_dr_misalignment == 0)
> +         return peeling_unsupported;
> +       if (known_unsupported_misalignment != DR_MISALIGNMENT_UNKNOWN)
> +         {
> +           if (curr_dr_misalignment != DR_MISALIGNMENT_UNKNOWN
> +               && curr_dr_misalignment != known_unsupported_misalignment)
> +             return peeling_unsupported;
> +         }
> +       else
> +         known_unsupported_misalignment = curr_dr_misalignment;
> +     }
>      }
>  
> -  return true;
> +  /* Vectorization is known to be supportable with peeling alone when there 
> is
> +     no unsupported DR.  */
> +  return has_unsupported_dr_p ? peeling_maybe_supported
> +                           : peeling_known_supported;
>  }
>  
>  /* Compare two data-references DRA and DRB to group them into chunks
> @@ -2264,20 +2302,20 @@ dr_align_group_sort_cmp (const void *dra_, const void 
> *drb_)
>       }
>  
>       -- Possibility 3: combination of loop peeling and versioning:
> -     for (i = 0; i < 3; i++){        # (scalar loop, not to be vectorized).
> -     x = q[i];
> -     p[i] = y;
> -     }
> -     if (p is aligned) {
> -     for (i = 3; i<N; i++){  # loop 3A
> +     if (p & q are mutually aligned) {
> +     for (i=0; i<3; i++){    # (peeled loop iterations).
> +       x = q[i];
> +       p[i] = y;
> +     }
> +     for (i=3; i<N; i++){    # loop 3A
>         x = q[i];                     # DR_MISALIGNMENT(q) = 0
>         p[i] = y;                     # DR_MISALIGNMENT(p) = 0
>       }
>       }
>       else {
> -     for (i = 3; i<N; i++){  # loop 3B
> -       x = q[i];                     # DR_MISALIGNMENT(q) = 0
> -       p[i] = y;                     # DR_MISALIGNMENT(p) = unaligned
> +     for (i=0; i<N; i++){    # (scalar loop, not to be vectorized).
> +       x = q[i];                     # DR_MISALIGNMENT(q) = 3
> +       p[i] = y;                     # DR_MISALIGNMENT(p) = unknown
>       }
>       }
>  
> @@ -2296,6 +2334,7 @@ vect_enhance_data_refs_alignment (loop_vec_info 
> loop_vinfo)
>    unsigned int i;
>    bool do_peeling = false;
>    bool do_versioning = false;
> +  bool try_peeling_with_versioning = false;
>    unsigned int npeel = 0;
>    bool one_misalignment_known = false;
>    bool one_misalignment_unknown = false;
> @@ -2361,30 +2400,38 @@ vect_enhance_data_refs_alignment (loop_vec_info 
> loop_vinfo)
>    /* While cost model enhancements are expected in the future, the high level
>       view of the code at this time is as follows:
>  
> -     A) If there is a misaligned access then see if peeling to align
> -        this access can make all data references satisfy
> -        vect_supportable_dr_alignment.  If so, update data structures
> -        as needed and return true.
> +     A) If there is a misaligned access then see if doing peeling alone can
> +     make all data references satisfy vect_supportable_dr_alignment.  If so,
> +     update data structures and return.
> +
> +     B) If peeling alone wasn't possible and there is a data reference with 
> an
> +     unknown misalignment that does not satisfy vect_supportable_dr_alignment
> +     then we may use either of the following two approaches.
>  
> -     B) If peeling wasn't possible and there is a data reference with an
> -        unknown misalignment that does not satisfy 
> vect_supportable_dr_alignment
> -        then see if loop versioning checks can be used to make all data
> -        references satisfy vect_supportable_dr_alignment.  If so, update
> -        data structures as needed and return true.
> +     B1) Try peeling with versioning: Add a runtime loop versioning check to
> +         see if all unsupportable data references are mutually aligned, which
> +         means they will be uniformly aligned after a certain amount of loop
> +         peeling.  If peeling and versioning can be used together, set
> +         LOOP_VINFO_ALLOW_MUTUAL_ALIGNMENT_P to TRUE and return.
>  
> -     C) If neither peeling nor versioning were successful then return false 
> if
> -        any data reference does not satisfy vect_supportable_dr_alignment.
> +     B2) Try versioning alone: Add a runtime loop versioning check to see if
> +         all unsupportable data references are already uniformly aligned
> +         without loop peeling.  If versioning can be applied alone, set
> +         LOOP_VINFO_ALLOW_MUTUAL_ALIGNMENT_P to FALSE and return.
>  
> -     D) Return true (all data references satisfy 
> vect_supportable_dr_alignment).
> +     Above B1 is more powerful and more likely to be adopted than B2.  But B2
> +     is still available and useful in some cases, for example, the cost model
> +     does not allow much peeling.
>  
> -     Note, Possibility 3 above (which is peeling and versioning together) is 
> not
> -     being done at this time.  */
> +     C) If none of above was successful then the alignment was not enhanced,
> +     just return.  */
>  
>    /* (1) Peeling to force alignment.  */
>  
> -  /* (1.1) Decide whether to perform peeling, and how many iterations to 
> peel:
> +  /* (1.1) Decide whether to perform peeling, how many iterations to peel, 
> and
> +     if vectorization may be supported by peeling with versioning.
>       Considerations:
> -     + How many accesses will become aligned due to the peeling
> +     - How many accesses will become aligned due to the peeling
>       - How many accesses will become unaligned due to the peeling,
>         and the cost of misaligned accesses.
>       - The cost of peeling (the extra runtime checks, the increase
> @@ -2732,9 +2779,27 @@ vect_enhance_data_refs_alignment (loop_vec_info 
> loop_vinfo)
>                               "Try peeling by %d\n", npeel);
>          }
>  
> -      /* Ensure that all datarefs can be vectorized after the peel.  */
> -      if (!vect_peeling_supportable (loop_vinfo, dr0_info, npeel))
> -     do_peeling = false;
> +      /* Check how peeling for alignment can support vectorization.  Function
> +      vect_peeling_supportable returns one of the three possible values:
> +      - PEELING_KNOWN_SUPPORTED: indicates that we know all unsupported
> +        datarefs can be aligned after peeling.  We can use peeling alone.
> +      - PEELING_MAYBE_SUPPORTED: indicates that peeling may be able to make
> +        these datarefs aligned but we are not sure about it at compile time.
> +        We will try peeling with versioning to add a runtime check to guard
> +        the peeled loop.
> +      - PEELING_UNSUPPORTED: indicates that peeling is almost impossible to
> +        support vectorization.  We will stop trying peeling.  */
> +      switch (vect_peeling_supportable (loop_vinfo, dr0_info, npeel))
> +     {
> +     case peeling_known_supported:
> +       break;
> +     case peeling_maybe_supported:
> +       try_peeling_with_versioning = true;
> +       break;
> +     case peeling_unsupported:
> +       do_peeling = false;
> +       break;
> +     }
>  
>        /* Check if all datarefs are supportable and log.  */
>        if (do_peeling
> @@ -2811,7 +2876,11 @@ vect_enhance_data_refs_alignment (loop_vec_info 
> loop_vinfo)
>  
>               vect_update_misalignment_for_peel (dr_info, dr0_info, npeel);
>             }
> +     }
>  
> +      if (do_peeling && !try_peeling_with_versioning)
> +     {
> +       /* Update data structures if peeling will be applied alone.  */
>            LOOP_VINFO_UNALIGNED_DR (loop_vinfo) = dr0_info;
>            if (npeel)
>              LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) = npeel;
> @@ -2939,6 +3008,11 @@ vect_enhance_data_refs_alignment (loop_vec_info 
> loop_vinfo)
>          LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo).truncate (0);
>      }
>  
> +  /* If we are trying peeling with versioning but versioning is disabled for
> +     some reason, peeling should be turned off together.  */
> +  if (try_peeling_with_versioning && !do_versioning)
> +    do_peeling = false;
> +
>    if (do_versioning)
>      {
>        const vec<stmt_vec_info> &may_misalign_stmts
> @@ -2958,12 +3032,28 @@ vect_enhance_data_refs_alignment (loop_vec_info 
> loop_vinfo)
>                               "Alignment of access forced using 
> versioning.\n");
>          }
>  
> -      if (dump_enabled_p ())
> -        dump_printf_loc (MSG_NOTE, vect_location,
> -                         "Versioning for alignment will be applied.\n");
> -
> -      /* Peeling and versioning can't be done together at this time.  */
> -      gcc_assert (! (do_peeling && do_versioning));
> +      if (do_peeling)
> +     {
> +       /* This point is reached if peeling and versioning are used together
> +          to ensure alignment.  Update data structures to make sure the loop
> +          is correctly peeled and a right runtime check is added for loop
> +          versioning.  */
> +       gcc_assert (try_peeling_with_versioning);
> +       LOOP_VINFO_UNALIGNED_DR (loop_vinfo) = dr0_info;
> +       LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) = -1;
> +       LOOP_VINFO_ALLOW_MUTUAL_ALIGNMENT (loop_vinfo) = true;
> +       if (dump_enabled_p ())
> +         dump_printf_loc (MSG_NOTE, vect_location,
> +                          "Both peeling and versioning will be applied.\n");
> +     }
> +      else
> +     {
> +       /* This point is reached if versioning is used alone.  */
> +       LOOP_VINFO_ALLOW_MUTUAL_ALIGNMENT (loop_vinfo) = false;
> +       if (dump_enabled_p ())
> +         dump_printf_loc (MSG_NOTE, vect_location,
> +                          "Versioning for alignment will be applied.\n");
> +     }
>  
>        return opt_result::success ();
>      }
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index 56a4e9a8b63..18f7c1bcea6 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -3787,10 +3787,11 @@ chain_cond_expr (tree *cond_expr, tree part_cond_expr)
>  
>     Input:
>     COND_EXPR  - input conditional expression.  New conditions will be chained
> -                with logical AND operation.
> -   LOOP_VINFO - two fields of the loop information are used.
> -                LOOP_VINFO_PTR_MASK is the mask used to check the alignment.
> -                LOOP_VINFO_MAY_MISALIGN_STMTS contains the refs to be 
> checked.
> +             with logical AND operation.
> +   LOOP_VINFO - three fields of the loop information are used.
> +             LOOP_VINFO_PTR_MASK is the mask used to check the alignment.
> +             LOOP_VINFO_MAY_MISALIGN_STMTS contains the refs to be checked.
> +             LOOP_VINFO_ALLOW_MUTUAL_ALIGNMENT indicates which check applies.
>  
>     Output:
>     COND_EXPR_STMT_LIST - statements needed to construct the conditional
> @@ -3798,7 +3799,20 @@ chain_cond_expr (tree *cond_expr, tree part_cond_expr)
>     The returned value is the conditional expression to be used in the if
>     statement that controls which version of the loop gets executed at 
> runtime.
>  
> -   The algorithm makes two assumptions:
> +   Based on the boolean value of LOOP_VINFO_ALLOW_MUTUAL_ALIGNMENT, we decide
> +   which type of check should be applied and create two different expressions
> +   accordingly.
> +     1) When LOOP_VINFO_ALLOW_MUTUAL_ALIGNMENT is false, we see if all data 
> refs
> +     to be checked are already aligned to an alignment boundary.  We create
> +     an expression of "(a_1 | a_2 | a_3 | ... | a_n) & mask", where "a_i" is
> +     the address of i'th data reference.
> +     2) When LOOP_VINFO_ALLOW_MUTUAL_ALIGNMENT is true, we see if all data 
> refs
> +     can be aligned to a boundary after a certain amount of peeling, in other
> +     words, their addresses have the same bottom bits according to the mask.
> +     We create "((a_1 ^ a_2) | (a_2 ^ a_3) | ... | (a_n-1 ^ a_n)) & mask",
> +     where "a_i" is the address of i'th data reference.
> +
> +   Both algorithms make two assumptions:
>       1) The number of bytes "n" in a vector is a power of 2.
>       2) An address "a" is aligned if a%n is zero and that this
>          test can be done as a&(n-1) == 0.  For example, for 16
> @@ -3818,6 +3832,7 @@ vect_create_cond_for_align_checks (loop_vec_info 
> loop_vinfo,
>    tree int_ptrsize_type;
>    char tmp_name[20];
>    tree or_tmp_name = NULL_TREE;
> +  tree prev_addr_tmp_name = NULL_TREE;
>    tree and_tmp_name;
>    gimple *and_stmt;
>    tree ptrsize_zero;
> @@ -3829,16 +3844,19 @@ vect_create_cond_for_align_checks (loop_vec_info 
> loop_vinfo,
>  
>    int_ptrsize_type = signed_type_for (ptr_type_node);
>  
> -  /* Create expression (mask & (dr_1 || ... || dr_n)) where dr_i is the 
> address
> -     of the first vector of the i'th data reference. */
> +  /* If LOOP_VINFO_ALLOW_MUTUAL_ALIGNMENT is true, we should have at least 
> two
> +     datarefs to check the mutual alignment.  */
> +  gcc_assert (may_misalign_stmts.length () > 1
> +           || !LOOP_VINFO_ALLOW_MUTUAL_ALIGNMENT (loop_vinfo));
>  
>    FOR_EACH_VEC_ELT (may_misalign_stmts, i, stmt_info)
>      {
>        gimple_seq new_stmt_list = NULL;
>        tree addr_base;
>        tree addr_tmp_name;
> +      tree xor_tmp_name;
>        tree new_or_tmp_name;
> -      gimple *addr_stmt, *or_stmt;
> +      gimple *addr_stmt, *or_stmt, *xor_stmt;
>        tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>        bool negative = tree_int_cst_compare
>       (DR_STEP (STMT_VINFO_DATA_REF (stmt_info)), size_zero_node) < 0;
> @@ -3860,20 +3878,56 @@ vect_create_cond_for_align_checks (loop_vec_info 
> loop_vinfo,
>        addr_stmt = gimple_build_assign (addr_tmp_name, NOP_EXPR, addr_base);
>        gimple_seq_add_stmt (cond_expr_stmt_list, addr_stmt);
>  
> -      /* The addresses are OR together.  */
> -
> -      if (or_tmp_name != NULL_TREE)
> -        {
> -          /* create: or_tmp = or_tmp | addr_tmp */
> -          sprintf (tmp_name, "orptrs%d", i);
> -       new_or_tmp_name = make_temp_ssa_name (int_ptrsize_type, NULL, 
> tmp_name);
> -       or_stmt = gimple_build_assign (new_or_tmp_name, BIT_IOR_EXPR,
> -                                      or_tmp_name, addr_tmp_name);
> -       gimple_seq_add_stmt (cond_expr_stmt_list, or_stmt);
> -          or_tmp_name = new_or_tmp_name;
> -        }
> +      if (LOOP_VINFO_ALLOW_MUTUAL_ALIGNMENT (loop_vinfo))
> +     {
> +       /* Create "((a_1 ^ a_2) | (a_2 ^ a_3) | ... | (a_n-1 ^ a_n)) & mask"
> +          to check mutual alignment.  */
> +       if (prev_addr_tmp_name != NULL_TREE)
> +         {
> +           sprintf (tmp_name, "xorptrs%d_%d", i - 1, i);

Very minor, but I think this could technically overflow the tmp_name
buffer.  The worst-case length of this string (including NT) is 9 + 2k
where k is length(str(i)).  When k = 6, i.e. if we have > 100,000 DRs,
then this will be a buffer overflow.  Presumably something else will
fall over / fail vectorization well before we get to having quite so
many DRs, but perhaps we should be more defensive here anyway, e.g. use
snprintf, and/or a bigger buffer?

Otherwise looks really good to me.  Thanks for working on this.

Alex

> +           xor_tmp_name = make_temp_ssa_name (int_ptrsize_type, NULL,
> +                                              tmp_name);
> +           xor_stmt = gimple_build_assign (xor_tmp_name, BIT_XOR_EXPR,
> +                                           prev_addr_tmp_name,
> +                                           addr_tmp_name);
> +           gimple_seq_add_stmt (cond_expr_stmt_list, xor_stmt);
> +           if (or_tmp_name == NULL_TREE)
> +             {
> +               /* Create the 1st XOR when the 2nd data ref is seen.  */
> +               or_tmp_name = xor_tmp_name;
> +             }
> +           else
> +             {
> +               /* Create: or_tmp = or_tmp | new_xor_tmp.  */
> +               sprintf (tmp_name, "orxors%d", i - 1);
> +               new_or_tmp_name = make_temp_ssa_name (int_ptrsize_type, NULL,
> +                                                     tmp_name);
> +               or_stmt = gimple_build_assign (new_or_tmp_name, BIT_IOR_EXPR,
> +                                              or_tmp_name, xor_tmp_name);
> +               gimple_seq_add_stmt (cond_expr_stmt_list, or_stmt);
> +               or_tmp_name = new_or_tmp_name;
> +             }
> +         }
> +       prev_addr_tmp_name = addr_tmp_name;
> +     }
>        else
> -        or_tmp_name = addr_tmp_name;
> +     {
> +       /* Create: "(a_1 | a_2 | a_3 | ... | a_n) & mask" to check if all
> +          addresses are already aligned.  */
> +       if (or_tmp_name != NULL_TREE)
> +         {
> +           /* Create: or_tmp = or_tmp | addr_tmp.  */
> +           sprintf (tmp_name, "orptrs%d", i);
> +           new_or_tmp_name = make_temp_ssa_name (int_ptrsize_type, NULL,
> +                                                 tmp_name);
> +           or_stmt = gimple_build_assign (new_or_tmp_name, BIT_IOR_EXPR,
> +                                          or_tmp_name, addr_tmp_name);
> +           gimple_seq_add_stmt (cond_expr_stmt_list, or_stmt);
> +           or_tmp_name = new_or_tmp_name;
> +         }
> +       else
> +         or_tmp_name = addr_tmp_name;
> +     }
>  
>      } /* end for i */
>  
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 8c5761d3c55..76c17e75f2e 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -1069,6 +1069,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, 
> vec_info_shared *shared)
>      using_decrementing_iv_p (false),
>      using_select_vl_p (false),
>      epil_using_partial_vectors_p (false),
> +    allow_mutual_alignment (false),
>      partial_load_store_bias (0),
>      peeling_for_gaps (false),
>      peeling_for_niter (false),
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 7aa2b02b63c..6ccafaf18e6 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -56,6 +56,14 @@ enum dr_alignment_support {
>    dr_aligned
>  };
>  
> +/* Define type of peeling support to indicate how peeling for alignment can 
> help
> +   make vectorization supported.  */
> +enum peeling_support {
> +  peeling_known_supported,
> +  peeling_maybe_supported,
> +  peeling_unsupported
> +};
> +
>  /* Define type of def-use cross-iteration cycle.  */
>  enum vect_def_type {
>    vect_uninitialized_def = 0,
> @@ -946,6 +954,13 @@ public:
>       epilogue of loop.  */
>    bool epil_using_partial_vectors_p;
>  
> +  /* True if we've decided to use peeling with versioning together, which 
> allows
> +     unaligned unsupported data refs to be uniformly aligned after a certain
> +     amount of peeling (mutual alignment).  Otherwise, we use versioning 
> alone
> +     so these data refs must be already aligned to a power-of-two boundary
> +     without peeling.  */
> +  bool allow_mutual_alignment;
> +
>    /* The bias for len_load and len_store.  For now, only 0 and -1 are
>       supported.  -1 must be used when a backend does not support
>       len_load/len_store with a length of zero.  */
> @@ -1070,6 +1085,7 @@ public:
>  #define LOOP_VINFO_USING_SELECT_VL_P(L) (L)->using_select_vl_p
>  #define LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P(L)                           
>   \
>    (L)->epil_using_partial_vectors_p
> +#define LOOP_VINFO_ALLOW_MUTUAL_ALIGNMENT(L) (L)->allow_mutual_alignment
>  #define LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS(L) (L)->partial_load_store_bias
>  #define LOOP_VINFO_VECT_FACTOR(L)          (L)->vectorization_factor
>  #define LOOP_VINFO_MAX_VECT_FACTOR(L)      (L)->max_vectorization_factor
> -- 
> 2.43.0
>

Re: [PATCH] vect: Use combined peeling and versioning for mutually aligned DRs

Reply via email to