https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151
--- Comment #2 from Hongtao.liu <crazylht at gmail dot com> --- with -fno-tree-vectorize, gcc also produce optimal code. mov rax, rsi mov rdx, rdi bswap rax bswap rdx ret Guess it's related to vectorizer cost model.