On Fri, Feb 19, 2016 at 10:44 AM, Phil Ruffwind <r...@rufflewind.com> wrote: > I tried to look for a workaround for this. It seemed that using a > union instead of memcpy was enough to convince GCC to optimize into a > single "mov". > > struct alpha unpack(uint64_t x) > { > union { > struct alpha r; > uint64_t i; > } u; > u.i = x; > return u.r; > } > > But that trick turned out to be short-lived. If I wrap the wrapper > with another function: > > struct alpha wrapperwrapper(uint64_t y) > { > return wrapper(y); > } > > I get the same 37-line assembly generated for this function. What's > even more strange is that if I just define two identical wrappers in > the same translation unit: > > struct alpha wrapper(uint64_t y) > { > return unpack(y); > } > > struct alpha wrapper2(uint64_t y) > { > return unpack(y); > } > > One of them gets optimized perfectly, while the other fails, even > though the bodies of the two functions are completely identical!
Yes, as said GCC tries to optimize the copy that results from copying the return value aggregate to the caller return value slot. GCC hopes for followup optimization opportunities here but obviously there are none in this case. Can you please open a bugreport? We eventually can tweak SRA heuristics in some way here. Note that you only get good code because the aggregate is passed and returned in a register (and thus "alignment" doesn't matter here) - something which is exposed too late to GCC to make use of that fact in SRA (well, easily at least). Richard.