https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91940
Bug ID: 91940 Summary: __builtin_bswap16 loop optimization Product: gcc Version: 9.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: matwey.kornilov at gmail dot com Target Milestone: --- Created attachment 46984 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46984&action=edit code snippet Hello, I am using "gcc (SUSE Linux) 9.2.1 20190903 [gcc-9-branch revision 275330]" and I see the following performance issue with the __builtin_bswap16() on x86_64 platform. Attached here is code sample implementing byte swapping for arrays of 2-byte words. I see that the following code (when compiled with -O3) inline void swab_bi(const void* from, void* to, std::size_t size) { const auto begin = reinterpret_cast<const std::uint16_t*>(from); const auto end = reinterpret_cast<const std::uint16_t*>(reinterpret_cast<const std::uint8_t*>(from) + size); auto out = reinterpret_cast<std::uint16_t*>(to); for(auto it = begin; it != end; ++it) { *(out++) = __builtin_bswap16(*it); } } takes 0.023 sec. on average to execute on my hardware (Intel Core-i5). While the following code inline void swab(const void* from, void* to, std::size_t size) { const auto begin = reinterpret_cast<const std::uint16_t*>(from); const auto end = reinterpret_cast<const std::uint16_t*>(reinterpret_cast<const std::uint8_t*>(from) + size); auto out = reinterpret_cast<std::uint16_t*>(to); for(auto it = begin; it != end; ++it) { *(out++) = ((*it & 0xFF) << 8) | ((*it & 0xFF00) >> 8); } } is *more* efficiently. It takes only 0.011 sec. When I try to dump assembler output for both function I see that packed instructions are used for the latter case: movdqu 0(%rbp,%rax), %xmm0 movdqa %xmm0, %xmm1 psllw $8, %xmm0 psrlw $8, %xmm1 por %xmm1, %xmm0 movups %xmm0, (%r12,%rax) addq $16, %rax while rolw is used for the former case: movzwl 0(%rbp,%rax), %edx rolw $8, %dx movw %dx, (%r12,%rax) addq $2, %rax