https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> ---
#include <stdlib.h>

struct X { double x[3]; };
typedef double v2df __attribute__((vector_size(16)));

v2df __attribute__((noipa))
foo (struct X x)
{
  return (v2df) {x.x[1], x.x[2] };
}

struct X y;
int main(int argc, char **argv)
{
  struct X x = y;
  int cnt = atoi (argv[1]);
  for (int i = 0; i < cnt; ++i)
    foo (x);
  return 0;
}

also reproduces it.  On both trunk and the branch we see 'foo' using
movups (combine does this as well even when not vectorizing).  Using
-mtune-ctrl=^sse_unaligned_load_optimal improves performance of the
micro benchmark more than 4-fold on Zen2.  Note that tuning also causes
us to not passing the argument using vector registers even though the
stack slot is aligned (but we use movupd there, we could use an aligned
move - but that's a missed optimization there).

Note doing non-vector argument setup but a misaligned vector load does
_not_ improve the situation, so the cray issue is solely caused by
-O2 enabling vectorization and eventually by the fact that using vector
stores for the argument setup might cause them to be more likely to
not retired compared to doing more scalar stores.

The same behavior can be observed on Haswell.

Reply via email to