https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151
Bug ID: 104151
Summary: x86: excessive code generated for 128-bit byteswap
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: nekotekina at gmail dot com
Target Milestone: ---
Hello, noticed that gcc generates redundant sequence of instructions for code
that does 128-bit byteswap implemented with 2 64-bit byteswap intrinsics. I
narrowed it to something like this:
__uint128_t bswap(__uint128_t a)
{
std::uint64_t x[2];
memcpy(x, &a, 16);
std::uint64_t y[2];
y[0] = __builtin_bswap64(x[1]);
y[1] = __builtin_bswap64(x[0]);
memcpy(&a, y, 16);
return a;
}
Produces:
https://godbolt.org/z/hEsPqvhv3
mov QWORD PTR [rsp-24], rdi
mov QWORD PTR [rsp-16], rsi
movdqa xmm0, XMMWORD PTR [rsp-24]
palignr xmm0, xmm0, 8
movdqa xmm1, xmm0
pshufb xmm1, XMMWORD PTR .LC0[rip]
movaps XMMWORD PTR [rsp-24], xmm1
mov rax, QWORD PTR [rsp-24]
mov rdx, QWORD PTR [rsp-16]
ret
Expected (alternatively for simd types - single pshufb, clang can do it):
mov rdx, rdi
mov rax, rsi
bswap rdx
bswap rax
ret