https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109326

--- Comment #5 from Steve Thompson <susurrus.of.qualia at gmail dot com> ---
(In reply to Andrew Pinski from comment #4)
> (In reply to Steve Thompson from comment #3)
> > However I don't understand why olock_reset_op() is so large.  It's
> > a trivial initializer for a descriptor with an array of olock_op_element
> > structures appended.  There's no way it should look like what I quoted.  I'd
> > be happy if I am experiencing a fever-dream over nothing due to ignorance,
> > but I am not convinced that that is the case.  If I am wrong I will be very
> > disappointed.
> 
> GCC unrolled the loop via vectorizing it.

OMG did it ever.  It seems that I'm an idiot and must apologise for wasting
everyone's time.

I fixed up some remaining support code and dug into it with gdb and determined
that it does, in fact work.   There appear to be distinct paths for particular
array ranges and logic to take care odd numbers, sort of like memcopy handling
large blocks.  

But I have to say that i really don't like it, and obviously I can work around
it by making the while() block similar to what is done in olock_init_op(). 
That gives me two functions with a combined text of 64 bytes if there is no
padding.  Compare this to the 1.2KB  of the original disassembly for a generous
factor of 20 code expansion.  That seems like a great way to bloat code.

I realize that -Os is available, but it eliminates a bunch of supposed inline
functions leading to linker errors for the missing symbols.  I'm not about to
try finding out why for the time being as I don't really need it.

For fun I built a short test program and measured the latency across
olock_reset_op for various array lengths:

          1    8   16   32
64B code:

1.2K code:

Reply via email to