On Wed, 14 Jun 2023, Richard Sandiford wrote: > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > Currently vect_determine_partial_vectors_and_peeling will decide > > to apply fully masking to the main loop despite > > --param vect-partial-vector-usage=1 when the currently analyzed > > vector mode results in a vectorization factor that's bigger > > than the number of scalar iterations. That's undesirable for > > targets where a vector mode can handle both partial vector and > > non-partial vector vectorization. I understand that for AARCH64 > > we have SVE and NEON but SVE can only do partial vector and > > NEON only non-partial vector vectorization, plus the target > > chooses to let cost comparison decide the vector mode to use. > > SVE can do both (and does non-partial for things that can't yet be > predicated, like reversing loads). But yeah, NEON can only do > non-partial. > > > For x86 and the upcoming AVX512 partial vector support the > > story is different, the target chooses the first (and largest) > > vector mode that can successfully used for vectorization. But > > that means with --param vect-partial-vector-usage=1 we will > > always choose AVX512 with partial vectors for the main loop > > even if, for example, V4SI would be a perfect fit with full > > vectors and no required epilog! > > Sounds like a good candidate for VECT_COMPARE_COSTS. Did you > try using that?
Yeah, I didn't try that because we've never done that and I expect unrelated "effects" ... > > The following tries to find the appropriate condition for > > this - I suppose simply refusing to set LOOP_VINFO_USING_PARTIAL_VECTORS_P > > on the main loop when --param vect-partial-vector-usage=1 will > > hurt AARCH64? > > Yeah, I'd expect so. > > > Incidentially looking up the docs for > > vect-partial-vector-usage suggests that it's not supposed to > > control epilog vectorization but instead > > "1 allows partial vector loads and stores if vectorization removes the > > need for the code to iterate". That's probably OK in the end > > but if there's a fixed size vector mode that allows the same thing > > without using masking that would be better. > > > > I wonder if we should special-case known niter (bounds) somehow > > when analyzing the vector modes and override the targets sorting? > > > > Maybe we want a new --param in addition to vect-epilogues-nomask > > and vect-partial-vector-usage to say we want masked epilogues? > > > > * tree-vect-loop.cc (vect_determine_partial_vectors_and_peeling): > > For non-VLA vectorization interpret param_vect_partial_vector_usage == 1 > > as only applying to epilogues. > > --- > > gcc/tree-vect-loop.cc | 10 +++++++++- > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > index 9be66b8fbc5..9323aa572d4 100644 > > --- a/gcc/tree-vect-loop.cc > > +++ b/gcc/tree-vect-loop.cc > > @@ -2478,7 +2478,15 @@ vect_determine_partial_vectors_and_peeling > > (loop_vec_info loop_vinfo, > > && !LOOP_VINFO_EPILOGUE_P (loop_vinfo) > > && !vect_known_niters_smaller_than_vf (loop_vinfo)) > > LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true; > > - else > > + /* Avoid using a large fixed size vectorization mode with masking > > + for the main loop when we were asked to only use masking for > > + the epilog. > > + ??? Ideally we'd start analysis with a better sized mode, > > + the param_vect_partial_vector_usage == 2 case suffers from > > + this as well. But there's a catch-22. */ > > + else if (!(!LOOP_VINFO_EPILOGUE_P (loop_vinfo) > > + && param_vect_partial_vector_usage == 1 > > + && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())) > > I don't think is_constant is a good thing to test here. The way things > work for SVE is essentially the same for VL-agnostic and VL-specific. > > Also, I think this hard-codes the assumption that the smallest mode > isn't maskable. Wouldn't it spuriously fail vectorisation if there > was no smaller mode available? > > Similarly, it looks like it could fail for AVX512 if processing 511 > characters, whereas I'd have expected AVX512 to still be the right > choice there. Possibly, yes. > If VECT_COMPARE_COSTS seems too expensive, we could try to look > for cases where a vector mode later in the list gives a VF that > is exactly equal to the number of scalar iterations. (Exactly > *divides* the number of scalar iterations would be less clear-cut IMO.) > > But converting a vector mode into a VF isn't trivial with our > current vectoriser structures, so I'm not sure how much of a > saving it would be over VECT_COMPARE_COSTS. And it would be much > more special-purpose than VECT_COMPARE_COSTS. It occured to me if NITER is constant or we have a constant bound on it we could set max_vf to the next higher power-of-two (and min_vf to the next lower power-of-two?) when doing the "autodetect" run. Unfortunately we still select the "bad" mode then, we simply fail that analysis and try with the next still too big but no longer "autodetect" mode and end up with AVX2 instead of the desired SSE2 for a simple (with disable cunrolli) void foo (unsigned *a) { for (int i = 0; i < 4; ++i) a[i] = 42; } but as you say it's difficult for get_vectype_for_scalar_type with yet unknown VF to decide on a good mode. It _could_ use min_vf so we at least get V4SImode but with SLP the required vector width could be larger (supposedly we've already analyzed the maximum size of dataref groups). Going through _all_ vector modes with the VF constraint would be costly as well. It's difficult to feed in enough info to the first get_vectype_for_scalar_type call done since that's from vect_analyze_data_refs. Between that and vect_analyze_data_ref_accesses there's pattern recog, trying to re-order and delay until after that might work. As you say, possibly VECT_COMPARE_COSTS is the way to go here ... Richard.