https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81496
Bug ID: 81496 Summary: AVX load from adjacent memory location followed by concatenation Product: gcc Version: 7.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jakub at gcc dot gnu.org Target Milestone: --- With -O2 -mavx{,2,512f}, we get on the following testcase: typedef __int128 V __attribute__((vector_size (32))); typedef long long W __attribute__((vector_size (32))); typedef int X __attribute__((vector_size (16))); typedef __int128 Y __attribute__((vector_size (64))); typedef long long Z __attribute__((vector_size (64))); W f1 (__int128 x, __int128 y) { return (W) ((V) { x, y }); } W f2 (__int128 x, __int128 y) { return (W) ((V) { y, x }); } movq %rdi, -16(%rsp) movq %rsi, -8(%rsp) movq %rdx, -32(%rsp) movq %rcx, -24(%rsp) vmovdqa -32(%rsp), %xmm0 vmovdqa -16(%rsp), %xmm1 vinserti128 $0x1, %xmm0, %ymm1, %ymm0 for f1, which I'm afraid is hard to do anything about, because RA didn't see the usefulness to spill in different order, but for f2: movq %rdx, -32(%rsp) movq %rcx, -24(%rsp) vmovdqa -32(%rsp), %xmm0 movq %rdi, -16(%rsp) movq %rsi, -8(%rsp) vinserti128 $0x1, -16(%rsp), %ymm0, %ymm0 Before scheduling, the movdqa is next to vinserti128 from the adjacent mem; in that case it might be a win to use a vmovdqa -32(%rsp), %ymm0 instead. Though, the MEM has just A128 in the rtl dump, so maybe we need to use vmovdqu instead, unless we can prove it is 256-bit aligned (it is in this case, but not generally).