https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115895

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org,
                   |                            |tnfchris at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
This isn't really about peeling for gaps, but the loop mask we compute is for
the lanes as laid out _after_ applying the load permutation of

       load permutation { 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 }

so this does not correctly mask the unpermuted load which under the current
mask accesses elements of 7 following vector iterations.

So the same would happen for an access without gap (but spread to more
elements).  For the testcase at hand we have group_size == 2 and gap == 1,
a case that gets to

          /* When we have a contiguous access across loop iterations
             but the access in the loop doesn't cover the full vector
             we can end up with no gap recorded but still excess
             elements accessed, see PR103116.  Make sure we peel for
             gaps if necessary and sufficient and give up if not.

             If there is a combination of the access not covering the full
             vector and a gap recorded then we may need to peel twice.  */
          if (loop_vinfo
              && (*memory_access_type == VMAT_CONTIGUOUS
                  || *memory_access_type == VMAT_CONTIGUOUS_REVERSE)
              && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
              && !multiple_p (group_size * LOOP_VINFO_VECT_FACTOR (loop_vinfo),
                              nunits))
            overrun_p = true;

but then later we have

              /* But peeling a single scalar iteration is enough if
                 we can use the next power-of-two sized partial
                 access and that is sufficiently small to be covered
                 by the single scalar iteration.  */
              unsigned HOST_WIDE_INT cnunits, cvf, cremain, cpart_size;
...

which we actually do not do for masked accesses or accesses where loop
masking is applied.  But we have

                  /* If all fails we can still resort to niter masking, so
                     enforce the use of partial vectors.  */
                  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
                    {
                      if (dump_enabled_p ())
                        dump_printf_loc (MSG_MISSED_OPTIMIZATION,
vect_location,
                                         "peeling for gaps insufficient for "
                                         "access unless using partial "
                                         "vectors\n");
                      LOOP_VINFO_MUST_USE_PARTIAL_VECTORS_P (loop_vinfo) =
true;

but this as said above isn't valid when the load vector used is bigger than
the group.  This was added for the VLA case where the guarding arithmetic
doesn't work.

So in the end we probably want to extend vector load shortening to
mask loads (and loop masked loads), but for now it seems something like the
following is needed.

I think the very same issue should exist with VLA vectors, having coverage
would be nice?


diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 0c0f999d3e3..b5dd1a2e40f 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2216,13 +2216,14 @@ get_group_load_store_type (vec_info *vinfo,
stmt_vec_inf
o stmt_info,

             If there is a combination of the access not covering the full
             vector and a gap recorded then we may need to peel twice.  */
+         bool large_vector_overrun_p = false;
          if (loop_vinfo
              && (*memory_access_type == VMAT_CONTIGUOUS
                  || *memory_access_type == VMAT_CONTIGUOUS_REVERSE)
              && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
              && !multiple_p (group_size * LOOP_VINFO_VECT_FACTOR (loop_vinfo),
                              nunits))
-           overrun_p = true;
+           large_vector_overrun_p = overrun_p = true;

          /* If the gap splits the vector in half and the target
             can do half-vector operations avoid the epilogue peeling
@@ -2273,7 +2274,8 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info
stmt_info,
                 access and that is sufficiently small to be covered
                 by the single scalar iteration.  */
              unsigned HOST_WIDE_INT cnunits, cvf, cremain, cpart_size;
-             if (!nunits.is_constant (&cnunits)
+             if (masked_p
+                 || !nunits.is_constant (&cnunits)
                  || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant (&cvf)
                  || (((cremain = (group_size * cvf - gap) % cnunits), true)
                      && ((cpart_size = (1 << ceil_log2 (cremain))), true)
@@ -2282,9 +2284,11 @@ get_group_load_store_type (vec_info *vinfo,
stmt_vec_info stmt_info,
                               (vectype, cnunits / cpart_size,
                                &half_vtype) == NULL_TREE)))
                {
-                 /* If all fails we can still resort to niter masking, so
-                    enforce the use of partial vectors.  */
-                 if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+                 /* If all fails we can still resort to niter masking unless
+                    the vectors used are too big, so enforce the use of
+                    partial vectors.  */
+                 if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+                     && !large_vector_overrun_p)
                    {
                      if (dump_enabled_p ())
                        dump_printf_loc (MSG_MISSED_OPTIMIZATION,
vect_location,
@@ -2302,6 +2306,16 @@ get_group_load_store_type (vec_info *vinfo,
stmt_vec_info stmt_info,
                      return false;
                    }
                }
+             else if (large_vector_overrun_p)
+               {
+                 if (dump_enabled_p ())
+                   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                                    "can't operate on partial vectors because
"
+                                    "only unmasked loads handle access "
+                                    "shortening required because of gaps at "
+                                    "the end of the access\n");
+                 LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+               }
            }
        }
     }

Reply via email to