https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109326
--- Comment #5 from Steve Thompson <susurrus.of.qualia at gmail dot com> --- (In reply to Andrew Pinski from comment #4) > (In reply to Steve Thompson from comment #3) > > However I don't understand why olock_reset_op() is so large. It's > > a trivial initializer for a descriptor with an array of olock_op_element > > structures appended. There's no way it should look like what I quoted. I'd > > be happy if I am experiencing a fever-dream over nothing due to ignorance, > > but I am not convinced that that is the case. If I am wrong I will be very > > disappointed. > > GCC unrolled the loop via vectorizing it. OMG did it ever. It seems that I'm an idiot and must apologise for wasting everyone's time. I fixed up some remaining support code and dug into it with gdb and determined that it does, in fact work. There appear to be distinct paths for particular array ranges and logic to take care odd numbers, sort of like memcopy handling large blocks. But I have to say that i really don't like it, and obviously I can work around it by making the while() block similar to what is done in olock_init_op(). That gives me two functions with a combined text of 64 bytes if there is no padding. Compare this to the 1.2KB of the original disassembly for a generous factor of 20 code expansion. That seems like a great way to bloat code. I realize that -Os is available, but it eliminates a bunch of supposed inline functions leading to linker errors for the missing symbols. I'm not about to try finding out why for the time being as I don't really need it. For fun I built a short test program and measured the latency across olock_reset_op for various array lengths: 1 8 16 32 64B code: 1.2K code: