https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90128

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Ugh.  Cactus is really ugly code :/  For one there's an invariant switch () in
the innermost loop, expanded to a binary tree (slightly different split point
GCC 8 vs. trunk), obviously unswitching cannot handle this.  This is a general
missed optimization precluding any vectorization attempt here.  Then we spill
the hell out of us because of the way the code is written.  Other than that
I don't see anything obvious here.  It might be that trunk:

    5802:       83 fb 06                cmp    $0x6,%ebx
    5805:       0f 84 25 84 00 00       je     dc30
<_ZL19ML_BSSN_Advect_BodyPK4
_cGHiiPKdS3_S3_PKiS5_iPKPd+0xdc30>
    580b:       0f 8f cf 1d 00 00       jg     75e0
<_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd+0x75e0>
    5811:       83 fb 02                cmp    $0x2,%ebx
    5814:       0f 85 06 c0 ff ff       jne    1820
<_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd+0x1820>

is worse to the branch predictor than the GCC 8 version

    89ee:       0f 84 bc 64 00 00       je     eeb0
<_ZL19ML_BSSN_Advect_BodyPK4
_cGHiiPKdS3_S3_PKiS5_iPKPd+0xeeb0>
    89f4:       0f 8e 96 45 00 00       jle    cf90
<_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd+0xcf90>
    89fa:       8b b4 24 a8 08 00 00    mov    0x8a8(%rsp),%esi
    8a01:       83 fe 06                cmp    $0x6,%esi
    8a04:       0f 85 e6 8e ff ff       jne    18f0
<_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd+0x18f0>

(notice the "padding" reload).  That is probably going to depend on final
code layout again of course.  I recall reading a third conditional jump
in a fetch word requires an additional branch predictor slot or so.

So it would be interesting to see if the branch misses accumulate on
that binary tree generated from the loop invariant switch where in
theory those should be all totally predictable.

That said, I'm not yet able to reproduce the slowdown but will try.

Reply via email to