https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80819
Bug ID: 80819
Summary: [5/6/7/8 regression] Useless store to the stack in
_mm_set_epi64x with SSE4 -mno-avx
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
#include <immintrin.h>
__m128i combine64(long long a, long long b) {
return _mm_set_epi64x(b,a);
}
gcc5/6/7/8-snapshot with -O3 -msse4 -mtune=haswell emits:
movq %rdi, %xmm0
movq %rsi, -16(%rsp) # dead store into the red-zone
pinsrq $1, %rsi, %xmm0
The same thing happens with -mtune=generic -msse4: it stores both halves to
memory, but only reloads the first half. The upper half is transferred with
pinsrq
movq %rdi, -16(%rsp)
movq %rsi, -24(%rsp) # dead store
movq -16(%rsp), %xmm0
pinsrq $1, %rsi, %xmm0
-mavx avoids the useless store, for tune=generic and tune=haswell.
This is a left-over from the store/reload strategy it uses without SSE4 (which
is worse than movq/movq/punpcklqdq, but that's a separate bug):
movq %rsi, -16(%rsp)
movq %rdi, %xmm0
movhps -16(%rsp), %xmm0
It's a regression from gcc4.x, where we get the expected good sequence for
-msse4 -mtune=haswell.
movq %rdi, %xmm0
pinsrq $1, %rsi, %xmm0
---------------
It doesn't happen for _mm_set_epi32. e.g.
__m128i combine32(int a, int b, int c, int d) {
return _mm_set_epi32(d,c,b,a);
}
compiles (with -mtune=haswell -msse4) to code that looks good to me.
movd %edx, %xmm1
movd %edi, %xmm0
pinsrd $1, %ecx, %xmm1
pinsrd $1, %esi, %xmm0
punpcklqdq %xmm1, %xmm0
clang uses 1 movd and 3x pinsrd, which is 2 bytes shorter and also 7 uops for
port5 on Haswell, but has less slightly ILP. (On CPUs where pinsrd is 2 uops,
the first one is probably an int->vector uop that can run before the
destination vector is ready.)
-mtune=generic still stores/reloads instead of using movd for %edi and %edx,
which is worse for most CPUs. (Which is a bug, IMO: I'll file a separate bug
for that.) But it does then use pinsrd with a register source for %ecx and
%esi, instead of a store/reload there.