https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90128
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- Ugh. Cactus is really ugly code :/ For one there's an invariant switch () in the innermost loop, expanded to a binary tree (slightly different split point GCC 8 vs. trunk), obviously unswitching cannot handle this. This is a general missed optimization precluding any vectorization attempt here. Then we spill the hell out of us because of the way the code is written. Other than that I don't see anything obvious here. It might be that trunk: 5802: 83 fb 06 cmp $0x6,%ebx 5805: 0f 84 25 84 00 00 je dc30 <_ZL19ML_BSSN_Advect_BodyPK4 _cGHiiPKdS3_S3_PKiS5_iPKPd+0xdc30> 580b: 0f 8f cf 1d 00 00 jg 75e0 <_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd+0x75e0> 5811: 83 fb 02 cmp $0x2,%ebx 5814: 0f 85 06 c0 ff ff jne 1820 <_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd+0x1820> is worse to the branch predictor than the GCC 8 version 89ee: 0f 84 bc 64 00 00 je eeb0 <_ZL19ML_BSSN_Advect_BodyPK4 _cGHiiPKdS3_S3_PKiS5_iPKPd+0xeeb0> 89f4: 0f 8e 96 45 00 00 jle cf90 <_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd+0xcf90> 89fa: 8b b4 24 a8 08 00 00 mov 0x8a8(%rsp),%esi 8a01: 83 fe 06 cmp $0x6,%esi 8a04: 0f 85 e6 8e ff ff jne 18f0 <_ZL19ML_BSSN_Advect_BodyPK4_cGHiiPKdS3_S3_PKiS5_iPKPd+0x18f0> (notice the "padding" reload). That is probably going to depend on final code layout again of course. I recall reading a third conditional jump in a fetch word requires an additional branch predictor slot or so. So it would be interesting to see if the branch misses accumulate on that binary tree generated from the loop invariant switch where in theory those should be all totally predictable. That said, I'm not yet able to reproduce the slowdown but will try.