https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123051

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, I can reproduce with -Ofast -march=native on Zen4 (not with generic).  With
perf the slowdown is much smaller, but still visible:

  50.83%        400971  lbm_peak.amd64-  lbm_peak.amd64-m64-gcc42-nn  [.]
LBM_performStreamCollide                ◆
  48.09%        389344  lbm_base.amd64-  lbm_base.amd64-m64-gcc42-nn  [.]
LBM_performStreamCollide                ▒

I'll note there's no vectorization in this function at all, so this is weird.
There's no loop vectorization at all, and thus no REDUC_DEF and thus the
patch should be a noop.  Indeed the slp2 dumps before/after are equal.

Weird.

There are SSA name differences later, but nothing real in .optimized.

So this isn't "really" caused by this revision.

The difference can be probably explained by extra

+lbm.c:85:1: note:   widen_mult pattern recognized: patt_226 = (long unsigned
int) patt_227;
+lbm.c:85:1: note:   extra pattern stmt: patt_227 = i_107 w* 8;

as pattern recog now optimistically assumes "used in pattern".  That's due
to an existing x86 specific testcase.

I'll note that an appropriate way to use even/odd would be to shuffle
back lanes after the operation with a vec_perm and elide the permute
when it feeds a reduction (but permute optimization is only before
code generation).

Reply via email to