https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115895
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rsandifo at gcc dot gnu.org, | |tnfchris at gcc dot gnu.org --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- This isn't really about peeling for gaps, but the loop mask we compute is for the lanes as laid out _after_ applying the load permutation of load permutation { 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 } so this does not correctly mask the unpermuted load which under the current mask accesses elements of 7 following vector iterations. So the same would happen for an access without gap (but spread to more elements). For the testcase at hand we have group_size == 2 and gap == 1, a case that gets to /* When we have a contiguous access across loop iterations but the access in the loop doesn't cover the full vector we can end up with no gap recorded but still excess elements accessed, see PR103116. Make sure we peel for gaps if necessary and sufficient and give up if not. If there is a combination of the access not covering the full vector and a gap recorded then we may need to peel twice. */ if (loop_vinfo && (*memory_access_type == VMAT_CONTIGUOUS || *memory_access_type == VMAT_CONTIGUOUS_REVERSE) && SLP_TREE_LOAD_PERMUTATION (slp_node).exists () && !multiple_p (group_size * LOOP_VINFO_VECT_FACTOR (loop_vinfo), nunits)) overrun_p = true; but then later we have /* But peeling a single scalar iteration is enough if we can use the next power-of-two sized partial access and that is sufficiently small to be covered by the single scalar iteration. */ unsigned HOST_WIDE_INT cnunits, cvf, cremain, cpart_size; ... which we actually do not do for masked accesses or accesses where loop masking is applied. But we have /* If all fails we can still resort to niter masking, so enforce the use of partial vectors. */ if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "peeling for gaps insufficient for " "access unless using partial " "vectors\n"); LOOP_VINFO_MUST_USE_PARTIAL_VECTORS_P (loop_vinfo) = true; but this as said above isn't valid when the load vector used is bigger than the group. This was added for the VLA case where the guarding arithmetic doesn't work. So in the end we probably want to extend vector load shortening to mask loads (and loop masked loads), but for now it seems something like the following is needed. I think the very same issue should exist with VLA vectors, having coverage would be nice? diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 0c0f999d3e3..b5dd1a2e40f 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -2216,13 +2216,14 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_inf o stmt_info, If there is a combination of the access not covering the full vector and a gap recorded then we may need to peel twice. */ + bool large_vector_overrun_p = false; if (loop_vinfo && (*memory_access_type == VMAT_CONTIGUOUS || *memory_access_type == VMAT_CONTIGUOUS_REVERSE) && SLP_TREE_LOAD_PERMUTATION (slp_node).exists () && !multiple_p (group_size * LOOP_VINFO_VECT_FACTOR (loop_vinfo), nunits)) - overrun_p = true; + large_vector_overrun_p = overrun_p = true; /* If the gap splits the vector in half and the target can do half-vector operations avoid the epilogue peeling @@ -2273,7 +2274,8 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, access and that is sufficiently small to be covered by the single scalar iteration. */ unsigned HOST_WIDE_INT cnunits, cvf, cremain, cpart_size; - if (!nunits.is_constant (&cnunits) + if (masked_p + || !nunits.is_constant (&cnunits) || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant (&cvf) || (((cremain = (group_size * cvf - gap) % cnunits), true) && ((cpart_size = (1 << ceil_log2 (cremain))), true) @@ -2282,9 +2284,11 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, (vectype, cnunits / cpart_size, &half_vtype) == NULL_TREE))) { - /* If all fails we can still resort to niter masking, so - enforce the use of partial vectors. */ - if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) + /* If all fails we can still resort to niter masking unless + the vectors used are too big, so enforce the use of + partial vectors. */ + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) + && !large_vector_overrun_p) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -2302,6 +2306,16 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, return false; } } + else if (large_vector_overrun_p) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't operate on partial vectors because " + "only unmasked loads handle access " + "shortening required because of gaps at " + "the end of the access\n"); + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; + } } } }