Richard Biener <rguent...@suse.de> writes: > On Mon, 18 Sep 2017, Richard Sandiford wrote: >> Richard Biener <rguent...@suse.de> writes: >> > The following is said to fix a 482.sphinx3 regression. >> > >> > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied. >> > >> > Richard. >> > >> > 2017-09-18 Richard Biener <rguent...@suse.de> >> > >> > PR tree-optimization/82220 >> > * tree-vect-loop.c (vect_estimate_min_profitable_iters): Exclude >> > epilogue niters from the min_profitable_iters compute. >> > >> > Index: gcc/tree-vect-loop.c >> > =================================================================== >> > --- gcc/tree-vect-loop.c (revision 252907) >> > +++ gcc/tree-vect-loop.c (working copy) >> > @@ -3663,8 +3663,8 @@ vect_estimate_min_profitable_iters (loop >> > min_profitable_iters); >> > >> > /* We want the vectorized loop to execute at least once. */ >> > - if (min_profitable_iters < (vf + peel_iters_prologue + >> > peel_iters_epilogue)) >> > - min_profitable_iters = vf + peel_iters_prologue + peel_iters_epilogue; >> > + if (min_profitable_iters < (vf + peel_iters_prologue)) >> > + min_profitable_iters = vf + peel_iters_prologue; >> >> Maybe we should still add 1 when peeling for gaps? > > You mean add vf? Yes, I guess so.
I think 1 is enough, since peeling for gaps forces one final scalar iteration to be done by the epilogue. You only get vf iterations being done by the epilogue if no iterations would have been done otherwise. E.g. for vf=4 and no prologue peeling, a scalar iteration count of 5 would be 1 vector iteration and 1 epilogue iteration even with peeling for gaps. A scalar iteration count of 8 would be 1 vector iteration and 4 epilogue iterations. But a scalar iteration count of 4 would not use the vector loop at all. In that case 5 seems like the right minimum. >> Even adding the prologue count seems a bit weird if we've guessed it to >> be vf/2. Wouldn't it be more profitable to vectorise with an iteration >> count of 1 vector iteration + 1 peeled iteration than 1 vector iteration >> + vf-1 peeled iterations, at least in percentage terms? Was just wondering >> if we should only add peel_iters_prologue when npeels > 0. > > Well, I specifically added the code to compensate for the case where > we only guess the count to vf/2. I tried to make us reliably avoid > entering the vector path for bwaves -- which of course only works > with versioning -- where we have a runtime iteration count of 5 but > still end up peeling for alignment. > > If we exactly know the iterations of the prologue we avoid peeling > for alignment already. Ah, OK. For bwaves, did we peel more than one iteration for alignment and so skip the vector loop entirely? Or did we still use the vector loop, but end up with something that was still slower because of the extra overhead? > So yes, I made the guessed case err on the scalar side (which regressed > sphinx). Before my patch it erred on the vector side by estimating > the prologue end epilogue iters as 0. I suppose we should split the > handling to correctly handle the non-guessed case and have some > explicit code doing sth for the guessed one. > > OTOH we might want to make the cost model check combined with the > prologue peeling properly handle the prologue/epilogue iterations. > I didn't check whether we do that (but IIRC the execution flow is > slightly sub-optimal here). Something like adding the number of peels to the threshold at runtime? Yeah, that sounds like it might be better. One problem I remember hitting was that, even if we do calculate a reasonable profitability threshold, we'd do the check in the same block as the versioning check, and so pay the full penalty of the versioning check regardless. I.e. it's "unprofitable_p | alias_p" rather than "unprofitable_p || alias_p". Not sure whether that's a different problem from bwaves though. Thanks, Richard