https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80197
--- Comment #2 from rguenther at suse dot de <rguenther at suse dot de> --- On Mon, 27 Mar 2017, ubizjak at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80197 > > Uroš Bizjak <ubizjak at gmail dot com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|UNCONFIRMED |NEW > Last reconfirmed| |2017-03-27 > CC| |jakub at gcc dot gnu.org, > | |rguenth at gcc dot gnu.org > Ever confirmed|0 |1 > > --- Comment #1 from Uroš Bizjak <ubizjak at gmail dot com> --- > For some reason, recently fixed if-conversion (PR79389) does not trigger in > PGO > case. There is still a jump with -O2: > > mulsd %xmm0, %xmm5 > mulsd %xmm2, %xmm2 > addsd %xmm2, %xmm5 > ucomisd %xmm5, %xmm4 > jb .L17 > .L16: > addl $1, %ebp > .L17: > addl $1, %edi > cmpl %edi, %ebx > je .L5 > > Since this asm corresponds to random operands, the jump can't be predicted: > > for (count=0; count<Num_samples; count++) > { > double x= Random_nextDouble(R); > double y= Random_nextDouble(R); > > if ( x*x + y*y <= 1.0) > under_curve ++; > > } > > Based on the discussion in PR79389, and the fact that -O2 and -O3 both compile > to a jump, I suspect that loop splitting cost model should be fine tuned to > also handle PGO case. Note that > > Adding some CCs. Not sure - loop splitting isn't done here and doing it would remove the if-conversion opportunity. I think that if FDO says either the true or false edge is very likely then not if-converting the loop is best? Or is a well-predicted conditional move as good as a well-predicted if? 10% missed branches would be more than /* When branch is predicted to be taken with probability lower than this threshold (in percent), then it is considered well predictable. */ DEFPARAM (PARAM_PREDICTABLE_BRANCH_OUTCOME, "predictable-branch-outcome", "Maximal estimated outcome of branch considered predictable.", 2, 0, 50) so it shouldn't affect if-conversion... Are we sure we're not hitting some architectural limitation here? Like disabling the loop stream cache because of size or the CFG? (otoh we have calls in the loop(?)).