https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92265
Bug ID: 92265 Summary: [x86] Dubious target costs for vec_construct Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org CC: amonakov at gcc dot gnu.org, uros at gcc dot gnu.org Target Milestone: --- Target: x86_64-linux-gnu The x86 costs for vec_construct look a little low, especially for -m32. E.g. gcc.target/i386/pr84101.c has: --------------------------------------------------- typedef struct uint64_pair uint64_pair_t ; struct uint64_pair { unsigned long w0 ; unsigned long w1 ; } ; uint64_pair_t pair(int num) { uint64_pair_t p ; p.w0 = num << 1 ; p.w1 = num >> 1 ; return p ; } --------------------------------------------------- where uint64_pair is actually a uint32_pair for -m32. If we consider applying SLP vectorisation to the store, we have the difference between: - 2 scalar_stores - 1 vec_construct + 1 vector_store The vec_construct cost for 64-bit and 128-bit vectors is: int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; i.e. one SSE op per element. With -mtune=intel this gives: - 2 scalar_stores = 3 + 3 insns - 1 vec_construct + 1 vector_store = 2 + 3 insns But for integer elements, the vec_construct actually needs two integer-to-vector moves followed by an SSE pack: movd %eax, %xmm1 movd %ecx, %xmm0 punpckldq %xmm1, %xmm0 movq %xmm0, (%edx) compared to: movl %eax, 4(%edx) movl %ecx, (%edx) I don't know enough about the Intel uarchs to know if there's a significant difference between these two in practice. But as Alexander points out, things are much worse if the elements are DImode rather than SImode, i.e. if we change the above "unsigned long"s to "__UINT64_TYPE__"s. We then end up spilling the four registers to the stack, loading them into a vector register, and then storing that vector register out separately: movl %edx, 8(%esp) ... movl %edx, 12(%esp) movq 8(%esp), %xmm0 movl %eax, 8(%esp) ... movl %edx, 12(%esp) movhps 8(%esp), %xmm0 movups %xmm0, (%ecx) vs. 4 scalar stores directly to (%ecx). Here we're operating on DIs and V2DIs, but the costs are the same as for SI vs. V2SI: - 2 scalar_stores = 3 + 3 insns - 1 vec_construct + 1 vector_store = 2 + 3 insns So as far as the vectoriser is concerned, the vector form seems cheaper.