On Mon, Mar 18, 2019 at 10:59 AM Richard Sandiford <richard.sandif...@arm.com> wrote: > > This patch fixes a case in which we vectorised something with a > fully-predicated loop even after the cost model had rejected it. > E.g. the loop in the testcase has the costs: > > Vector inside of loop cost: 27 > Vector prologue cost: 0 > Vector epilogue cost: 0 > Scalar iteration cost: 7 > Scalar outside cost: 6 > Vector outside cost: 0 > prologue iterations: 0 > epilogue iterations: 0 > > and we can see that the loop executes at most three times, but we > decided to vectorise it anyway. > > (The costs here are equal for three iterations, but the same thing > happens even when the vector code is strictly more expensive.) > > The problem is the handling of "/VF" in: > > /* Calculate number of iterations required to make the vector version > profitable, relative to the loop bodies only. The following condition > must hold true: > SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC > where > SIC = scalar iteration cost, VIC = vector iteration cost, > VOC = vector outside cost, VF = vectorization factor, > PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations > SOC = scalar outside cost for run time cost model check. */ > > We treat the "/VF" as truncating, but for fully-predicated loops, it's > closer to a ceil division, since fractional iterations are handled by a > full iteration with some predicate bits set to false. > > The easiest fix seemed to be to calculate the minimum number of vector > iterations first, then use that to calculate the minimum number of scalar > iterations. > > Calculating the minimum number of vector iterations might make sense for > unpredicated loops too, since calculating the scalar niters directly > doesn't take into account the fact that the VIC multiple has to be an > integer. But the handling of PL_ITERS and EP_ITERS for unpredicated > loops is a bit hand-wavy anyway, so maybe vagueness here cancels out > vagueness there?
Well, their estimate if we do not know them statically is. If we statically know pl/ep iters the formulat should match, no? Hopefully for GCC 10 we can even fix the case in PR89754. > Either way, changing this for unpredicated loops would be much too > invasive for stage 4, so the patch keeps it specific to fully-predicated > loops (i.e. SVE) for now. There's no functional change for other targets. > > Tested on aarch64-linux-gnu with and without SVE, and on x86_64-linux-gnu. > This is a regression introduced by the original cost model patches for > fully-predicated loops, so OK for GCC 9? OK. Thanks, Richard. > Thanks, > Richard > > > 2019-03-18 Richard Sandiford <richard.sandif...@arm.com> > > gcc/ > * tree-vect-loop.c (vect_estimate_min_profitable_iters): Fix the > calculation of the minimum number of scalar iterations for > fully-predicated loops. > > gcc/testsuite/ > * gcc.target/aarch64/sve/cost_model_1.c: New test. > > Index: gcc/tree-vect-loop.c > =================================================================== > --- gcc/tree-vect-loop.c 2019-03-08 18:15:33.668751875 +0000 > +++ gcc/tree-vect-loop.c 2019-03-18 09:55:03.257194326 +0000 > @@ -3600,14 +3600,89 @@ vect_estimate_min_profitable_iters (loop > /* Calculate number of iterations required to make the vector version > profitable, relative to the loop bodies only. The following condition > must hold true: > - SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC > + SIC * niters + SOC > VIC * ((niters - NPEEL) / VF) + VOC > where > SIC = scalar iteration cost, VIC = vector iteration cost, > VOC = vector outside cost, VF = vectorization factor, > - PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations > + NPEEL = prologue iterations + epilogue iterations, > SOC = scalar outside cost for run time cost model check. */ > > - if ((scalar_single_iter_cost * assumed_vf) > (int) vec_inside_cost) > + int saving_per_viter = (scalar_single_iter_cost * assumed_vf > + - vec_inside_cost); > + if (saving_per_viter <= 0) > + { > + if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize) > + warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd, > + "vectorization did not happen for a simd loop"); > + > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > + "cost model: the vector iteration cost = %d " > + "divided by the scalar iteration cost = %d " > + "is greater or equal to the vectorization factor = > %d" > + ".\n", > + vec_inside_cost, scalar_single_iter_cost, > assumed_vf); > + *ret_min_profitable_niters = -1; > + *ret_min_profitable_estimate = -1; > + return; > + } > + > + /* ??? The "if" arm is written to handle all cases; see below for what > + we would do for !LOOP_VINFO_FULLY_MASKED_P. */ > + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > + { > + /* Rewriting the condition above in terms of the number of > + vector iterations (vniters) rather than the number of > + scalar iterations (niters) gives: > + > + SIC * (vniters * VF + NPEEL) + SOC > VIC * vniters + VOC > + > + <==> vniters * (SIC * VF - VIC) > VOC - SIC * NPEEL - SOC > + > + For integer N, X and Y when X > 0: > + > + N * X > Y <==> N >= (Y /[floor] X) + 1. */ > + int outside_overhead = (vec_outside_cost > + - scalar_single_iter_cost * peel_iters_prologue > + - scalar_single_iter_cost * peel_iters_epilogue > + - scalar_outside_cost); > + /* We're only interested in cases that require at least one > + vector iteration. */ > + int min_vec_niters = 1; > + if (outside_overhead > 0) > + min_vec_niters = outside_overhead / saving_per_viter + 1; > + > + if (dump_enabled_p ()) > + dump_printf (MSG_NOTE, " Minimum number of vector iterations: %d\n", > + min_vec_niters); > + > + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > + { > + /* Now that we know the minimum number of vector iterations, > + find the minimum niters for which the scalar cost is larger: > + > + SIC * niters > VIC * vniters + VOC - SOC > + > + We know that the minimum niters is no more than > + vniters * VF + NPEEL, but it might be (and often is) less > + than that if a partial vector iteration is cheaper than the > + equivalent scalar code. */ > + int threshold = (vec_inside_cost * min_vec_niters > + + vec_outside_cost > + - scalar_outside_cost); > + if (threshold <= 0) > + min_profitable_iters = 1; > + else > + min_profitable_iters = threshold / scalar_single_iter_cost + 1; > + } > + else > + /* Convert the number of vector iterations into a number of > + scalar iterations. */ > + min_profitable_iters = (min_vec_niters * assumed_vf > + + peel_iters_prologue > + + peel_iters_epilogue); > + } > + else > { > min_profitable_iters = ((vec_outside_cost - scalar_outside_cost) > * assumed_vf > @@ -3617,8 +3692,7 @@ vect_estimate_min_profitable_iters (loop > min_profitable_iters = 0; > else > { > - min_profitable_iters /= ((scalar_single_iter_cost * assumed_vf) > - - vec_inside_cost); > + min_profitable_iters /= saving_per_viter; > > if ((scalar_single_iter_cost * assumed_vf * min_profitable_iters) > <= (((int) vec_inside_cost * min_profitable_iters) > @@ -3627,24 +3701,6 @@ vect_estimate_min_profitable_iters (loop > min_profitable_iters++; > } > } > - /* vector version will never be profitable. */ > - else > - { > - if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize) > - warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd, > - "vectorization did not happen for a simd loop"); > - > - if (dump_enabled_p ()) > - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > - "cost model: the vector iteration cost = %d " > - "divided by the scalar iteration cost = %d " > - "is greater or equal to the vectorization factor = > %d" > - ".\n", > - vec_inside_cost, scalar_single_iter_cost, > assumed_vf); > - *ret_min_profitable_niters = -1; > - *ret_min_profitable_estimate = -1; > - return; > - } > > if (dump_enabled_p ()) > dump_printf (MSG_NOTE, > @@ -3668,10 +3724,34 @@ vect_estimate_min_profitable_iters (loop > > Non-vectorized variant is SIC * niters and it must win over vector > variant on the expected loop trip count. The following condition must > hold true: > - SIC * niters > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC + SOC */ > + SIC * niters > VIC * ((niters - NPEEL) / VF) + VOC + SOC */ > > if (vec_outside_cost <= 0) > min_profitable_estimate = 0; > + else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > + { > + /* This is a repeat of the code above, but with + SOC rather > + than - SOC. */ > + int outside_overhead = (vec_outside_cost > + - scalar_single_iter_cost * peel_iters_prologue > + - scalar_single_iter_cost * peel_iters_epilogue > + + scalar_outside_cost); > + int min_vec_niters = 1; > + if (outside_overhead > 0) > + min_vec_niters = outside_overhead / saving_per_viter + 1; > + > + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > + { > + int threshold = (vec_inside_cost * min_vec_niters > + + vec_outside_cost > + + scalar_outside_cost); > + min_profitable_estimate = threshold / scalar_single_iter_cost + 1; > + } > + else > + min_profitable_estimate = (min_vec_niters * assumed_vf > + + peel_iters_prologue > + + peel_iters_epilogue); > + } > else > { > min_profitable_estimate = ((vec_outside_cost + scalar_outside_cost) > Index: gcc/testsuite/gcc.target/aarch64/sve/cost_model_1.c > =================================================================== > --- /dev/null 2019-03-08 11:40:14.606883727 +0000 > +++ gcc/testsuite/gcc.target/aarch64/sve/cost_model_1.c 2019-03-18 > 09:55:03.257194326 +0000 > @@ -0,0 +1,12 @@ > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */ > + > +void > +f (unsigned int *restrict x, unsigned int *restrict y, > + unsigned char *restrict z, unsigned int n) > +{ > + for (unsigned int i = 0; i < n % 4; ++i) > + x[i] = x[i] + y[i] + z[i]; > +} > + > +/* { dg-final { scan-tree-dump "not vectorized: estimated iteration count > too small" vect } } */ > +/* { dg-final { scan-tree-dump "vectorized 0 loops" vect } } */