https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123051
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- OK, I can reproduce with -Ofast -march=native on Zen4 (not with generic). With perf the slowdown is much smaller, but still visible: 50.83% 400971 lbm_peak.amd64- lbm_peak.amd64-m64-gcc42-nn [.] LBM_performStreamCollide ◆ 48.09% 389344 lbm_base.amd64- lbm_base.amd64-m64-gcc42-nn [.] LBM_performStreamCollide ▒ I'll note there's no vectorization in this function at all, so this is weird. There's no loop vectorization at all, and thus no REDUC_DEF and thus the patch should be a noop. Indeed the slp2 dumps before/after are equal. Weird. There are SSA name differences later, but nothing real in .optimized. So this isn't "really" caused by this revision. The difference can be probably explained by extra +lbm.c:85:1: note: widen_mult pattern recognized: patt_226 = (long unsigned int) patt_227; +lbm.c:85:1: note: extra pattern stmt: patt_227 = i_107 w* 8; as pattern recog now optimistically assumes "used in pattern". That's due to an existing x86 specific testcase. I'll note that an appropriate way to use even/odd would be to shuffle back lanes after the operation with a vec_perm and elide the permute when it feeds a reduction (but permute optimization is only before code generation).
