https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50417
--- Comment #17 from npl at chello dot at --- I got interrupted by a colleague at work, part 2 of the ramblings... Everything you could argue against memcpy beeing replaced by simpler instructions, doesnt change that the same issue persists with the __builtin_memcpy function, which is explicitely saying you want the optimizations. A pointer to a uint32 can be assumed to be proper aligned, CREATING such a pointer thats not aligned is already undefined behaviour by the standard (the compiler could zero out bits for example). I dont think that what happens afterwards with something that shouldn`t exist in the first place is an argument against optimizing proper code. Further, I lack a consistent way of dealing with potential aliasing pointers. Using memcpy seems the sanest way, simply because its standards compliant, supported everywhere and your code wont mysteriously break once you use LTO or higher optimization settings. Compilers can reliably detect this and replace memcpy since years (ignoring this issue, which I would consider a bug), so there is no draw back. Its a feature common pretty much everywhere, and a valid recommendation in many discussions related to the topic. Consider the example below for illustration, FIXEDMEMCPY is how the plain memcpy should work and already does work for archs with unaligned access. (I had planned to post the code for 32bit x86, but the assembly is rather ugly, amd64 would work with "unsigned long" and "unsigned long long"). I already ran in such issues, when different software components define their own fixedwidth types. Its a practical issue where pointing to paragraphs of the standard dont help, unless you provide a proper solution with it. The FIXEDMEMCPY hack is fine for gcc but compilerspecific. In short: * Optimizing memcpy to simple instructions is a reality and expected, the behaviour (slow code) on arm (and other archs with req. alignment) is a unwelcome oddity * memcpy is one of the few ways to deal with aliasing, and the most standards compliant. (theres unions too, but thats not standards compliant) * I dont see a problem in replacing standard functions (and __builtin_memcpy has the same issue) * I dont see a problem in expecting a correctly aligned pointer, and doing undefined behaviour if the pointer could cause undefined behaviour. typedef unsigned uint32_t; typedef unsigned long uint32_alt; _Static_assert(sizeof(uint32_t) == sizeof(uint32_alt), "you picked a bad architecture or typedefs for this example"); #define FIXEDMEMCPY(a, b, s) __builtin_memcpy(__builtin_assume_aligned(a, __alignof__(*a)), __builtin_assume_aligned(b, __alignof__(*b)), s) unsigned breakme(uint32_t *ptr, uint32_alt *ptr2, uint32_t a) { /* normally in different compilation units, but LTO doesnt care */ *ptr = 0; *ptr2 = a; return *ptr; } unsigned fixme(uint32_t *ptr, uint32_alt *ptr2, uint32_t a) { /* fixes aliasing, but should be as fast as simple accesses */ uint32_t val = 0; FIXEDMEMCPY(ptr, &val, 4); FIXEDMEMCPY(ptr2 , &a, 4); uint32_t val2; FIXEDMEMCPY(&val2, ptr, 4); return val2; } 00000000 <breakme>: 0: e3a03000 mov r3, #0 4: e5803000 str r3, [r0] 8: e1a00003 mov r0, r3 // Oops: retval = 0 c: e5812000 str r2, [r1] 10: e12fff1e bx lr 00000014 <fixme>: 14: e3a03000 mov r3, #0 18: e5803000 str r3, [r0] 1c: e5812000 str r2, [r1] 20: e5900000 ldr r0, [r0] // The load thats missing above 24: e24dd010 sub sp, sp, #16 // Time for another 28: e28dd010 add sp, sp, #16 // Bugreport ? 2c: e12fff1e bx lr