https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
--- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> --- #include <stdlib.h> struct X { double x[3]; }; typedef double v2df __attribute__((vector_size(16))); v2df __attribute__((noipa)) foo (struct X x) { return (v2df) {x.x[1], x.x[2] }; } struct X y; int main(int argc, char **argv) { struct X x = y; int cnt = atoi (argv[1]); for (int i = 0; i < cnt; ++i) foo (x); return 0; } also reproduces it. On both trunk and the branch we see 'foo' using movups (combine does this as well even when not vectorizing). Using -mtune-ctrl=^sse_unaligned_load_optimal improves performance of the micro benchmark more than 4-fold on Zen2. Note that tuning also causes us to not passing the argument using vector registers even though the stack slot is aligned (but we use movupd there, we could use an aligned move - but that's a missed optimization there). Note doing non-vector argument setup but a misaligned vector load does _not_ improve the situation, so the cray issue is solely caused by -O2 enabling vectorization and eventually by the fact that using vector stores for the argument setup might cause them to be more likely to not retired compared to doing more scalar stores. The same behavior can be observed on Haswell.