Jan Hubicka <hubi...@ucw.cz> writes: >> Tamar Christina <tamar.christ...@arm.com> writes: >> > Hi All, >> > >> > The resulting predicate register of a whilelo is not >> > restricted to the lower half of the predicate register file. >> > >> > As such these tests started failing after recent changes >> > because the whilelo outside the loop is getting assigned p15. >> >> It's the whilelo in the loop for me. We go from: >> >> .L3: >> ld1b z31.b, p7/z, [x4, x3] >> movprfx z30, z31 >> mul z30.b, p5/m, z30.b, z29.b >> st1b z30.b, p7, [x4, x3] >> mov p6.b, p7.b >> add x3, x3, x0 >> whilelo p7.b, w3, w1 >> b.any .L3 >> >> to: >> >> .L3: >> ld1b z31.b, p7/z, [x3, x2] >> movprfx z29, z31 >> mul z29.b, p6/m, z29.b, z30.b >> st1b z29.b, p7, [x3, x2] >> add x2, x2, x0 >> whilelo p15.b, w2, w1 >> b.any .L4 >> [...] >> .p2align 2,,3 >> .L4: >> mov p7.b, p15.b >> b .L3 >> >> This adds an extra (admittedly unconditional) branch to every non-final >> vector iteration, which seems unfortunate. I don't think we'd see >> p8-p15 otherwise, since the result of the whilelo is used as a >> governing predicate by the next iteration of the loop. >> >> This happens because the scalar loop is given an 89% chance of iterating. >> Previously we gave the vector loop an 83.33% chance of iterating, whereas >> after 061f74c06735e1fa35b910ae we give it a 12% chance. 0.89^16 == 15.50%, >> so the new probabilities definitely preserve the original probabilities >> more closely. But for purely heuristic probabilities like these, I'm >> not sure we should lean so heavily into the idea that the vector >> latch is unlikely. >> >> Honza, Richi, any thoughts? Just wanted to double-check that this >> was operating as expected before making the tests accept the (arguably) >> less efficient code. It looks like the commit was more aimed at fixing >> the profile counts for the epilogues, rather than the main loop. > > You are right that we shold not scale down static profiles in case they > are artifically flat. It is nice to have actual testcase. > Old code used to test: > > /* Without profile feedback, loops for which we do not know a better > estimate > are assumed to roll 10 times. When we unroll such loop, it appears to > roll too little, and it may even seem to be cold. To avoid this, we > ensure that the created loop appears to roll at least 5 times (but at > most as many times as before unrolling). Don't do adjustment if profile > feedback is present. */ > if (new_est_niter < 5 && !profile_p) > { > if (est_niter < 5) > new_est_niter = est_niter; > else > new_est_niter = 5; > } > > This is not right when profile feedback is around and also when we > managed to determine precise #of itrations at branch prediction time and > did not cap. > > So I replaced it iwht the test that adjusted header count is not smaller > than the preheader edge count. However this will happily get loop > iteration count close to 0. > > It is bit hard to figure out if profile is realistic: > > Sometimes we do > profile_status_for_fn (cfun) != PROFILE_READ > I am trying to get rid of this test. With LTO or when comdat profile is > lost we inline together functions with and without profile. > > We can test for quality of loop header count to be precise or adjusted. > However at the time vectorizer is modifying loop profile we already > adjusted it for the initial conditional for profitability threshold and > drop it to GUESSED.Even with profile feedback we do not know outcome > probability of that one (Ondrej Kubanek's histograms will help here).
Ah, yeah, hadn't thought about that. > So I think we want to check if we have loop iteration estimate recorded > (that should be true for both profile feedback and loops with known trip > count) and if so compare it what profile says and it is more or less in > match consider profile realistic. This needs to be done before > vectorizer starts tampering with the loop. > > I will try to make patch for that. Thanks! Richard