https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719
--- Comment #5 from gpnuma at centaurean dot com --- Which gcc and which clang ? Because on my platform, in the above code, if you isolate 3 bytes at a time and 5 bytes at a time it is way slower than clang (by doing manual unrolling). Or maybe it's the interaction with the bit masking that causes a problem ? (In reply to H.J. Lu from comment #4) > I compared __builtin_memcpy one size at a time. Here are results in > cycles: > > clang 1 bytes: 17193410146 > gcc 1 bytes: 15440244966 > clang 2 bytes: 8997535880 > gcc 2 bytes: 8147449530 > clang 3 bytes: 6002276628 > gcc 3 bytes: 5430387704 > clang 4 bytes: 4497121282 > gcc 4 bytes: 4069604454 > clang 5 bytes: 3644879742 > gcc 5 bytes: 3258094970 > clang 6 bytes: 3045612708 > gcc 6 bytes: 2728410608 > clang 7 bytes: 2574110178 > gcc 7 bytes: 2330365680 > clang 8 bytes: 969894432 > gcc 8 bytes: 6436950208 > > GCC is faster except for 8 byte size.