[Bug c++/91940] New: __builtin_bswap16 loop optimization

matwey.kornilov at gmail dot com Mon, 30 Sep 2019 09:22:59 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91940


            Bug ID: 91940
           Summary: __builtin_bswap16 loop optimization
           Product: gcc
           Version: 9.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: matwey.kornilov at gmail dot com
  Target Milestone: ---

Created attachment 46984
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46984&action=edit
code snippet

Hello,

I am using "gcc (SUSE Linux) 9.2.1 20190903 [gcc-9-branch revision 275330]" and
I see the following performance issue with the __builtin_bswap16() on x86_64
platform.

Attached here is code sample implementing byte swapping for arrays of 2-byte
words.
I see that the following code (when compiled with -O3)

inline void swab_bi(const void* from, void* to, std::size_t size) {
        const auto begin = reinterpret_cast<const std::uint16_t*>(from);
        const auto end = reinterpret_cast<const
std::uint16_t*>(reinterpret_cast<const std::uint8_t*>(from) + size);
        auto out = reinterpret_cast<std::uint16_t*>(to);

        for(auto it = begin; it != end; ++it) {
                *(out++) = __builtin_bswap16(*it);
        }
}

takes 0.023 sec. on average to execute on my hardware (Intel Core-i5).
While the following code

inline void swab(const void* from, void* to, std::size_t size) {
        const auto begin = reinterpret_cast<const std::uint16_t*>(from);
        const auto end = reinterpret_cast<const
std::uint16_t*>(reinterpret_cast<const std::uint8_t*>(from) + size);
        auto out = reinterpret_cast<std::uint16_t*>(to);

        for(auto it = begin; it != end; ++it) {
                *(out++) = ((*it & 0xFF) << 8) | ((*it & 0xFF00) >> 8);
        }
}

is *more* efficiently. It takes only 0.011 sec.

When I try to dump assembler output for both function I see that packed
instructions are used for the latter case:

        movdqu  0(%rbp,%rax), %xmm0
        movdqa  %xmm0, %xmm1
        psllw   $8, %xmm0
        psrlw   $8, %xmm1
        por     %xmm1, %xmm0
        movups  %xmm0, (%r12,%rax)
        addq    $16, %rax

while rolw is used for the former case:

        movzwl  0(%rbp,%rax), %edx
        rolw    $8, %dx
        movw    %dx, (%r12,%rax)
        addq    $2, %rax

[Bug c++/91940] New: __builtin_bswap16 loop optimization

Reply via email to