https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

            Bug ID: 104151
           Summary: x86: excessive code generated for 128-bit byteswap
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: nekotekina at gmail dot com
  Target Milestone: ---

Hello, noticed that gcc generates redundant sequence of instructions for code
that does 128-bit byteswap implemented with 2 64-bit byteswap intrinsics. I
narrowed it to something like this:

__uint128_t bswap(__uint128_t a)
{
    std::uint64_t x[2];
    memcpy(x, &a, 16);
    std::uint64_t y[2];
    y[0] = __builtin_bswap64(x[1]);
    y[1] = __builtin_bswap64(x[0]);
    memcpy(&a, y, 16);
    return a;
}

Produces:
https://godbolt.org/z/hEsPqvhv3

        mov     QWORD PTR [rsp-24], rdi
        mov     QWORD PTR [rsp-16], rsi
        movdqa  xmm0, XMMWORD PTR [rsp-24]
        palignr xmm0, xmm0, 8
        movdqa  xmm1, xmm0
        pshufb  xmm1, XMMWORD PTR .LC0[rip]
        movaps  XMMWORD PTR [rsp-24], xmm1
        mov     rax, QWORD PTR [rsp-24]
        mov     rdx, QWORD PTR [rsp-16]
        ret

Expected (alternatively for simd types - single pshufb, clang can do it):

        mov     rdx, rdi
        mov     rax, rsi
        bswap   rdx
        bswap   rax
        ret

Reply via email to