https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151
Bug ID: 104151 Summary: x86: excessive code generated for 128-bit byteswap Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: nekotekina at gmail dot com Target Milestone: --- Hello, noticed that gcc generates redundant sequence of instructions for code that does 128-bit byteswap implemented with 2 64-bit byteswap intrinsics. I narrowed it to something like this: __uint128_t bswap(__uint128_t a) { std::uint64_t x[2]; memcpy(x, &a, 16); std::uint64_t y[2]; y[0] = __builtin_bswap64(x[1]); y[1] = __builtin_bswap64(x[0]); memcpy(&a, y, 16); return a; } Produces: https://godbolt.org/z/hEsPqvhv3 mov QWORD PTR [rsp-24], rdi mov QWORD PTR [rsp-16], rsi movdqa xmm0, XMMWORD PTR [rsp-24] palignr xmm0, xmm0, 8 movdqa xmm1, xmm0 pshufb xmm1, XMMWORD PTR .LC0[rip] movaps XMMWORD PTR [rsp-24], xmm1 mov rax, QWORD PTR [rsp-24] mov rdx, QWORD PTR [rsp-16] ret Expected (alternatively for simd types - single pshufb, clang can do it): mov rdx, rdi mov rax, rsi bswap rdx bswap rax ret