https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50417
--- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to npl from comment #17) > I got interrupted by a colleague at work, part 2 of the ramblings... > > Everything you could argue against memcpy beeing replaced by simpler > instructions, doesnt change that the same issue persists with the > __builtin_memcpy function, which is explicitely saying you want the > optimizations. > > A pointer to a uint32 can be assumed to be proper aligned, CREATING such a > pointer thats not aligned is already undefined behaviour by the standard > (the compiler could zero out bits for example). I dont think that what > happens afterwards with something that shouldn`t exist in the first place is > an argument against optimizing proper code. > > Further, I lack a consistent way of dealing with potential aliasing > pointers. Using memcpy seems the sanest way, simply because its standards > compliant, supported everywhere and your code wont mysteriously break once > you use LTO or higher optimization settings. > Compilers can reliably detect this and replace memcpy since years (ignoring > this issue, which I would consider a bug), so there is no draw back. Its a > feature common pretty much everywhere, and a valid recommendation in many > discussions related to the topic. > > Consider the example below for illustration, FIXEDMEMCPY is how the plain > memcpy should work and already does work for archs with unaligned access. > (I had planned to post the code for 32bit x86, but the assembly is rather > ugly, amd64 would work with "unsigned long" and "unsigned long long"). > > I already ran in such issues, when different software components define > their own fixedwidth types. Its a practical issue where pointing to > paragraphs of the standard dont help, unless you provide a proper solution > with it. The FIXEDMEMCPY hack is fine for gcc but compilerspecific. > > In short: > * Optimizing memcpy to simple instructions is a reality and expected, the > behaviour (slow code) on arm (and other archs with req. alignment) is a > unwelcome oddity > * memcpy is one of the few ways to deal with aliasing, and the most > standards compliant. (theres unions too, but thats not standards compliant) > * I dont see a problem in replacing standard functions (and __builtin_memcpy > has the same issue) > * I dont see a problem in expecting a correctly aligned pointer, and doing > undefined behaviour if the pointer could cause undefined behaviour. > > > > typedef unsigned uint32_t; > typedef unsigned long uint32_alt; > _Static_assert(sizeof(uint32_t) == sizeof(uint32_alt), "you picked a bad > architecture or typedefs for this example"); > > #define FIXEDMEMCPY(a, b, s) __builtin_memcpy(__builtin_assume_aligned(a, > __alignof__(*a)), __builtin_assume_aligned(b, __alignof__(*b)), s) > unsigned breakme(uint32_t *ptr, uint32_alt *ptr2, uint32_t a) > { > /* normally in different compilation units, but LTO doesnt care */ > *ptr = 0; > *ptr2 = a; > return *ptr; > } > > unsigned fixme(uint32_t *ptr, uint32_alt *ptr2, uint32_t a) > { > /* fixes aliasing, but should be as fast as simple accesses */ > uint32_t val = 0; > FIXEDMEMCPY(ptr, &val, 4); > FIXEDMEMCPY(ptr2 , &a, 4); > uint32_t val2; > FIXEDMEMCPY(&val2, ptr, 4); > return val2; > } > > 00000000 <breakme>: > 0: e3a03000 mov r3, #0 > 4: e5803000 str r3, [r0] > 8: e1a00003 mov r0, r3 // Oops: retval = 0 > c: e5812000 str r2, [r1] > 10: e12fff1e bx lr > > 00000014 <fixme>: > 14: e3a03000 mov r3, #0 > 18: e5803000 str r3, [r0] > 1c: e5812000 str r2, [r1] > 20: e5900000 ldr r0, [r0] // The load thats missing above > 24: e24dd010 sub sp, sp, #16 // Time for another > 28: e28dd010 add sp, sp, #16 // Bugreport ? > 2c: e12fff1e bx lr It's not done on STRICT_ALIGNMENT platforms because not all of those expand misaligned moves correctly (IIRC). Looking at RTL expansion at least the misaligned destination will work correctly. The question remains is what happens for -Os and for example both misaligned source and destination. Or on x86 where a simple rep; movb; is possible (plus the register setup of course).