On 18 March 2019 10:58:53 CET, Richard Sandiford <[email protected]>
wrote:
>This patch fixes a case in which we vectorised something with a
>fully-predicated loop even after the cost model had rejected it.
>E.g. the loop in the testcase has the costs:
>
> Vector inside of loop cost: 27
> Vector prologue cost: 0
> Vector epilogue cost: 0
> Scalar iteration cost: 7
> Scalar outside cost: 6
> Vector outside cost: 0
> prologue iterations: 0
> epilogue iterations: 0
>
>and we can see that the loop executes at most three times, but we
>decided to vectorise it anyway.
>
>(The costs here are equal for three iterations, but the same thing
>happens even when the vector code is strictly more expensive.)
>
>The problem is the handling of "/VF" in:
>
> /* Calculate number of iterations required to make the vector version
> profitable, relative to the loop bodies only. The following condition
> must hold true:
> SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
> where
> SIC = scalar iteration cost, VIC = vector iteration cost,
> VOC = vector outside cost, VF = vectorization factor,
> PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations
> SOC = scalar outside cost for run time cost model check. */
>
>We treat the "/VF" as truncating, but for fully-predicated loops, it's
>closer to a ceil division, since fractional iterations are handled by a
>full iteration with some predicate bits set to false.
>
>The easiest fix seemed to be to calculate the minimum number of vector
>iterations first, then use that to calculate the minimum number of
>scalar
>iterations.
>
>Calculating the minimum number of vector iterations might make sense
>for
>unpredicated loops too, since calculating the scalar niters directly
>doesn't take into account the fact that the VIC multiple has to be an
>integer. But the handling of PL_ITERS and EP_ITERS for unpredicated
>loops is a bit hand-wavy anyway, so maybe vagueness here cancels out
>vagueness there?
>
>Either way, changing this for unpredicated loops would be much too
>invasive for stage 4, so the patch keeps it specific to
>fully-predicated
>loops (i.e. SVE) for now. There's no functional change for other
>targets.
>
>Tested on aarch64-linux-gnu with and without SVE, and on
>x86_64-linux-gnu.
>This is a regression introduced by the original cost model patches for
>fully-predicated loops, so OK for GCC 9?
>
>Thanks,
>Richard
>
>
>2019-03-18 Richard Sandiford <[email protected]>
>
>gcc/
> * tree-vect-loop.c (vect_estimate_min_profitable_iters): Fix the
> calculation of the minimum number of scalar iterations for
> fully-predicated loops.
>
>gcc/testsuite/
> * gcc.target/aarch64/sve/cost_model_1.c: New test.
>
>Index: gcc/tree-vect-loop.c
>===================================================================
>--- gcc/tree-vect-loop.c 2019-03-08 18:15:33.668751875 +0000
>+++ gcc/tree-vect-loop.c 2019-03-18 09:55:03.257194326 +0000
>@@ -3600,14 +3600,89 @@ vect_estimate_min_profitable_iters (loop
> /* Calculate number of iterations required to make the vector version
> profitable, relative to the loop bodies only. The following condition
> must hold true:
>- SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
>+ SIC * niters + SOC > VIC * ((niters - NPEEL) / VF) + VOC
> where
> SIC = scalar iteration cost, VIC = vector iteration cost,
> VOC = vector outside cost, VF = vectorization factor,
>- PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations
>+ NPEEL = prologue iterations + epilogue iterations,
> SOC = scalar outside cost for run time cost model check. */
>
>- if ((scalar_single_iter_cost * assumed_vf) > (int) vec_inside_cost)
>+ int saving_per_viter = (scalar_single_iter_cost * assumed_vf
>+ - vec_inside_cost);
>+ if (saving_per_viter <= 0)
>+ {
>+ if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize)
>+ warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd,
>+ "vectorization did not happen for a simd loop");
>+
>+ if (dump_enabled_p ())
>+ dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>+ "cost model: the vector iteration cost = %d "
>+ "divided by the scalar iteration cost = %d "
>+ "is greater or equal to the vectorization factor = %d"
>+ ".\n",
>+ vec_inside_cost, scalar_single_iter_cost, assumed_vf);
>+ *ret_min_profitable_niters = -1;
>+ *ret_min_profitable_estimate = -1;
>+ return;
>+ }
>+
>+ /* ??? The "if" arm is written to handle all cases; see below for
>what
>+ we would do for !LOOP_VINFO_FULLY_MASKED_P. */
>+ if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>+ {
The condition above seems to contain...
>+ /* Rewriting the condition above in terms of the number of
>+ vector iterations (vniters) rather than the number of
>+ scalar iterations (niters) gives:
>+
>+ SIC * (vniters * VF + NPEEL) + SOC > VIC * vniters + VOC
>+
>+ <==> vniters * (SIC * VF - VIC) > VOC - SIC * NPEEL - SOC
>+
>+ For integer N, X and Y when X > 0:
>+
>+ N * X > Y <==> N >= (Y /[floor] X) + 1. */
>+ int outside_overhead = (vec_outside_cost
>+ - scalar_single_iter_cost * peel_iters_prologue
>+ - scalar_single_iter_cost * peel_iters_epilogue
>+ - scalar_outside_cost);
>+ /* We're only interested in cases that require at least one
>+ vector iteration. */
>+ int min_vec_niters = 1;
>+ if (outside_overhead > 0)
>+ min_vec_niters = outside_overhead / saving_per_viter + 1;
>+
>+ if (dump_enabled_p ())
>+ dump_printf (MSG_NOTE, " Minimum number of vector iterations: %d\n",
>+ min_vec_niters);
>+
>+ if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>+ {
... this identical condition, AFAICS?
So this second conditions else arm should be dead, shouldn't it?
thanks,
>+ /* Now that we know the minimum number of vector iterations,
>+ find the minimum niters for which the scalar cost is larger:
>+
>+ SIC * niters > VIC * vniters + VOC - SOC
>+
>+ We know that the minimum niters is no more than
>+ vniters * VF + NPEEL, but it might be (and often is) less
>+ than that if a partial vector iteration is cheaper than the
>+ equivalent scalar code. */
>+ int threshold = (vec_inside_cost * min_vec_niters
>+ + vec_outside_cost
>+ - scalar_outside_cost);
>+ if (threshold <= 0)
>+ min_profitable_iters = 1;
>+ else
>+ min_profitable_iters = threshold / scalar_single_iter_cost + 1;
>+ }
>+ else
>+ /* Convert the number of vector iterations into a number of
>+ scalar iterations. */
>+ min_profitable_iters = (min_vec_niters * assumed_vf
>+ + peel_iters_prologue
>+ + peel_iters_epilogue);
>+ }
>+ else
> {
> min_profitable_iters = ((vec_outside_cost - scalar_outside_cost)
> * assumed_vf
>@@ -3617,8 +3692,7 @@ vect_estimate_min_profitable_iters (loop
> min_profitable_iters = 0;
> else
> {
>- min_profitable_iters /= ((scalar_single_iter_cost * assumed_vf)
>- - vec_inside_cost);
>+ min_profitable_iters /= saving_per_viter;
>
> if ((scalar_single_iter_cost * assumed_vf * min_profitable_iters)
> <= (((int) vec_inside_cost * min_profitable_iters)
>@@ -3627,24 +3701,6 @@ vect_estimate_min_profitable_iters (loop
> min_profitable_iters++;
> }
> }
>- /* vector version will never be profitable. */
>- else
>- {
>- if (LOOP_VINFO_LOOP (loop_vinfo)->force_vectorize)
>- warning_at (vect_location.get_location_t (), OPT_Wopenmp_simd,
>- "vectorization did not happen for a simd loop");
>-
>- if (dump_enabled_p ())
>- dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>- "cost model: the vector iteration cost = %d "
>- "divided by the scalar iteration cost = %d "
>- "is greater or equal to the vectorization factor = %d"
>- ".\n",
>- vec_inside_cost, scalar_single_iter_cost, assumed_vf);
>- *ret_min_profitable_niters = -1;
>- *ret_min_profitable_estimate = -1;
>- return;
>- }
>
> if (dump_enabled_p ())
> dump_printf (MSG_NOTE,
>@@ -3668,10 +3724,34 @@ vect_estimate_min_profitable_iters (loop
>
> Non-vectorized variant is SIC * niters and it must win over vector
>variant on the expected loop trip count. The following condition must
>hold true:
>- SIC * niters > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC + SOC
>*/
>+ SIC * niters > VIC * ((niters - NPEEL) / VF) + VOC + SOC */
>
> if (vec_outside_cost <= 0)
> min_profitable_estimate = 0;
>+ else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>+ {
>+ /* This is a repeat of the code above, but with + SOC rather
>+ than - SOC. */
>+ int outside_overhead = (vec_outside_cost
>+ - scalar_single_iter_cost * peel_iters_prologue
>+ - scalar_single_iter_cost * peel_iters_epilogue
>+ + scalar_outside_cost);
>+ int min_vec_niters = 1;
>+ if (outside_overhead > 0)
>+ min_vec_niters = outside_overhead / saving_per_viter + 1;
>+
>+ if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>+ {
>+ int threshold = (vec_inside_cost * min_vec_niters
>+ + vec_outside_cost
>+ + scalar_outside_cost);
>+ min_profitable_estimate = threshold / scalar_single_iter_cost + 1;
>+ }
>+ else
>+ min_profitable_estimate = (min_vec_niters * assumed_vf
>+ + peel_iters_prologue
>+ + peel_iters_epilogue);
>+ }
> else
> {
> min_profitable_estimate = ((vec_outside_cost + scalar_outside_cost)
>Index: gcc/testsuite/gcc.target/aarch64/sve/cost_model_1.c
>===================================================================
>--- /dev/null 2019-03-08 11:40:14.606883727 +0000
>+++ gcc/testsuite/gcc.target/aarch64/sve/cost_model_1.c 2019-03-18
>09:55:03.257194326 +0000
>@@ -0,0 +1,12 @@
>+/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details" } */
>+
>+void
>+f (unsigned int *restrict x, unsigned int *restrict y,
>+ unsigned char *restrict z, unsigned int n)
>+{
>+ for (unsigned int i = 0; i < n % 4; ++i)
>+ x[i] = x[i] + y[i] + z[i];
>+}
>+
>+/* { dg-final { scan-tree-dump "not vectorized: estimated iteration
>count too small" vect } } */
>+/* { dg-final { scan-tree-dump "vectorized 0 loops" vect } } */