Re: [RFC] Tree loop unroller pass

Michael Matz Tue, 20 Feb 2018 07:46:41 -0800

Hi,

On Fri, 16 Feb 2018, Wilco Dijkstra wrote:


> > How so?
> 
> I thought it is well-known for many years that the rtl unroller doesn't 
> work properly. In practically all cases where LLVM beats GCC, it is due 
> to unrolling small loops.

I thought it was because of vectorizing at -O2, not due to unrolling 
itself.

> > To generate more ILP for modern out-of-order processors you need to be
> > able to do followup transforms that remove dependences.  So rather than
> > inventing magic params we should look at those transforms and key
> > unrolling on them.  Like we do in predictive commoning or other passes
> > that end up performing unrolling as part of their transform.
> 
> This is why unrolling needs to be done at the tree level. Alias info is 
> correct, addressing modes end up more optimal and the scheduler can now 
> interleave the iterations (often not possible after the rtl-unroller due 
> to bad alias info).

The better alias info at gimple level is offsetted by worse information 
about addressing modes, register pressure and precise machine 
instructions.  It's not a very obvious tradeoff between both approaches.

> > Our measurements on x86 concluded that unrolling isn't worth it, in 
> > fact it very often hurts.  That was of course with saner params than 
> > the defaults of the RTL unroller.
> >
> > Often you even have to fight with followup passes doing stuff that 
> > ends up inreasing register pressure too much so we end up spilling.
> 
> Yes that's why I mentioned we should only unroll small loops where there 
> is always a benefit from reduced loop counter increments and branching.

With modern OOO processors the loop counter increments and branching costs 
about nothing on average due to speculation and bypassing.  In fact a loop 
cache might prefer small loops.

> > So _please_ first get testcases we know unrolling will be beneficial 
> > on and _also_ have a thorough description _why_.
> 
> I'm sure we can find good examples. The why will be obvious just from 
> instruction count.

The instruction count is only one part.  If the unrolling really enables 
followup transformations like cross-iteration redundancy removal, then 
it's worthwhile.  But for that we already have passes (e.g. predictive 
commoning).  If the only thing unrolling does is getting rid of N-1 loop 
counter test-and-branch, it's not beneficial on OOO processors (if the 
predictor is worth anything).

The reason why we tried some unrolling measurements last year is to 
establish if unrolling is worth it or not on an normal OOO x86-64 
processor (I basically wanted to have a case for implementing a generic 
unroller on gimple, and replace the RTL one (!)).  The results were so 
underwhelming that I dropped the plan.

> > With Ooo CPUs speculatively executing the next iterations I very much
> > doubt that.
> 
> OoO execution is like really dumb loop unrolling,

Actually I think of it as a rather clever way of unrolling.  The processor 
has exact knowledge of (non-)dependencies, even on those the compiler 
can't proof to not exist.

And if a dependency exists for real (such that an OOO CPU can't ignore it) 
how would you propose the compiler to magically remove it to get rid of 
the involved instructions?  That can usually only be done by duplicating 
threads-of-compute and that increases not decreases instruction count (or 
leaves them the same).  For instance the RTL unroller replaces IV adjusts 
ala:
  i = i + 1; use (i)
  i = i + 1; use (i)
with
  i1 = i + 1; use (i1)
  i2 = i + 2; use (i2)
So, compiler was able to get rid of one dependency chain, but in expense 
of having to use a new constant and register.  A cost analysis needs to 
happen to be sure that this is worthwhile.

> you still have all the dependencies between iterations, all the 
> branches, all the pointer increments etc. Optimizing those reduces 
> instruction counts like vectorization. Fewer instructions implies faster 
> code.

Also not true in this generality.  We recently had a case where 
vectorization hurt very much but it wasn't obvious why.  Our current 
theory (and we had many) is that because before the vector ALU actually 
could start working there had to be eight (strided) loads done per vector, 
for multiple streams, and all those loads eventually clogged the 
pipeline.  The scalar loop (on the OOO CPU) nicely intermingled ALU and 
memory ops.  In the end Richi got the vector loop to be only 2-3% slower 
than the scalar loop (and that was a fairly major effort), even though
from looking at the instructions it had only 70-80% of the original scalar 
dynamic instructions count.

In any case, I'm with Richi on this one about having some dependable 
testcases that speed up with a tree unroller, reasons of why this is so, 
and some cost model that reflects these reasons at least roughly.

All the above is of course only true for OOO CPUs.  For in-order unrolling 
is helpful.  But for getting that it would be good to work towards 
exchanging the RTL with a gimple unroller if must be to not have two of 
them.


Ciao,
Michael.
P.S: For a very long time I also thought "hmm, gimple unroller, that would 
be nice to have; could get rid of RTL one".  Until these experiments last 
year.  Since then I'm basically 180 degrees around ;-)

Re: [RFC] Tree loop unroller pass

Reply via email to