Jan Hubicka <hubi...@ucw.cz> writes:
>> Tamar Christina <tamar.christ...@arm.com> writes:
>> > Hi All,
>> >
>> > The resulting predicate register of a whilelo is not
>> > restricted to the lower half of the predicate register file.
>> >
>> > As such these tests started failing after recent changes
>> > because the whilelo outside the loop is getting assigned p15.
>> 
>> It's the whilelo in the loop for me.  We go from:
>> 
>> .L3:
>>         ld1b    z31.b, p7/z, [x4, x3]
>>         movprfx z30, z31
>>         mul     z30.b, p5/m, z30.b, z29.b
>>         st1b    z30.b, p7, [x4, x3]
>>         mov     p6.b, p7.b
>>         add     x3, x3, x0
>>         whilelo p7.b, w3, w1
>>         b.any   .L3
>> 
>> to:
>> 
>> .L3:
>>         ld1b    z31.b, p7/z, [x3, x2]
>>         movprfx z29, z31
>>         mul     z29.b, p6/m, z29.b, z30.b
>>         st1b    z29.b, p7, [x3, x2]
>>         add     x2, x2, x0
>>         whilelo p15.b, w2, w1
>>         b.any   .L4
>>         [...]
>>         .p2align 2,,3
>> .L4:
>>         mov     p7.b, p15.b
>>         b       .L3
>> 
>> This adds an extra (admittedly unconditional) branch to every non-final
>> vector iteration, which seems unfortunate.  I don't think we'd see
>> p8-p15 otherwise, since the result of the whilelo is used as a
>> governing predicate by the next iteration of the loop.
>> 
>> This happens because the scalar loop is given an 89% chance of iterating.
>> Previously we gave the vector loop an 83.33% chance of iterating, whereas
>> after 061f74c06735e1fa35b910ae we give it a 12% chance.  0.89^16 == 15.50%,
>> so the new probabilities definitely preserve the original probabilities
>> more closely.  But for purely heuristic probabilities like these, I'm
>> not sure we should lean so heavily into the idea that the vector
>> latch is unlikely.
>> 
>> Honza, Richi, any thoughts?  Just wanted to double-check that this
>> was operating as expected before making the tests accept the (arguably)
>> less efficient code.  It looks like the commit was more aimed at fixing
>> the profile counts for the epilogues, rather than the main loop.
>
> You are right that we shold not scale down static profiles in case they
> are artifically flat. It is nice to have actual testcase.
> Old code used to test:
>
>   /* Without profile feedback, loops for which we do not know a better 
> estimate
>      are assumed to roll 10 times.  When we unroll such loop, it appears to
>      roll too little, and it may even seem to be cold.  To avoid this, we
>      ensure that the created loop appears to roll at least 5 times (but at
>      most as many times as before unrolling).  Don't do adjustment if profile
>      feedback is present.  */
>   if (new_est_niter < 5 && !profile_p)
>     {
>       if (est_niter < 5)
>         new_est_niter = est_niter;
>       else 
>         new_est_niter = 5;
>     } 
>
> This is not right when profile feedback is around and also when we
> managed to determine precise #of itrations at branch prediction time and
> did not cap.
>
> So I replaced it iwht the test that adjusted header count is not smaller
> than the preheader edge count.  However this will happily get loop
> iteration count close to 0.
>
> It is bit hard to figure out if profile is realistic:
>
> Sometimes we do
>    profile_status_for_fn (cfun) != PROFILE_READ
> I am trying to get rid of this test.  With LTO or when comdat profile is
> lost we inline together functions with and without profile.
>
> We can test for quality of loop header count to be precise or adjusted.
> However at the time vectorizer is modifying loop profile we already
> adjusted it for the initial conditional for profitability threshold and
> drop it to GUESSED.Even with profile feedback we do not know outcome
> probability of that one (Ondrej Kubanek's histograms will help here).

Ah, yeah, hadn't thought about that.

> So I think we want to check if we have loop iteration estimate recorded
> (that should be true for both profile feedback and loops with known trip
> count) and if so compare it what profile says and it is more or less in
> match consider profile realistic.  This needs to be done before
> vectorizer starts tampering with the loop.
>
> I will try to make patch for that.

Thanks!

Richard

Reply via email to