Hi, On Fri, 16 Feb 2018, Wilco Dijkstra wrote:
> > How so? > > I thought it is well-known for many years that the rtl unroller doesn't > work properly. In practically all cases where LLVM beats GCC, it is due > to unrolling small loops. I thought it was because of vectorizing at -O2, not due to unrolling itself. > > To generate more ILP for modern out-of-order processors you need to be > > able to do followup transforms that remove dependences. So rather than > > inventing magic params we should look at those transforms and key > > unrolling on them. Like we do in predictive commoning or other passes > > that end up performing unrolling as part of their transform. > > This is why unrolling needs to be done at the tree level. Alias info is > correct, addressing modes end up more optimal and the scheduler can now > interleave the iterations (often not possible after the rtl-unroller due > to bad alias info). The better alias info at gimple level is offsetted by worse information about addressing modes, register pressure and precise machine instructions. It's not a very obvious tradeoff between both approaches. > > Our measurements on x86 concluded that unrolling isn't worth it, in > > fact it very often hurts. That was of course with saner params than > > the defaults of the RTL unroller. > > > > Often you even have to fight with followup passes doing stuff that > > ends up inreasing register pressure too much so we end up spilling. > > Yes that's why I mentioned we should only unroll small loops where there > is always a benefit from reduced loop counter increments and branching. With modern OOO processors the loop counter increments and branching costs about nothing on average due to speculation and bypassing. In fact a loop cache might prefer small loops. > > So _please_ first get testcases we know unrolling will be beneficial > > on and _also_ have a thorough description _why_. > > I'm sure we can find good examples. The why will be obvious just from > instruction count. The instruction count is only one part. If the unrolling really enables followup transformations like cross-iteration redundancy removal, then it's worthwhile. But for that we already have passes (e.g. predictive commoning). If the only thing unrolling does is getting rid of N-1 loop counter test-and-branch, it's not beneficial on OOO processors (if the predictor is worth anything). The reason why we tried some unrolling measurements last year is to establish if unrolling is worth it or not on an normal OOO x86-64 processor (I basically wanted to have a case for implementing a generic unroller on gimple, and replace the RTL one (!)). The results were so underwhelming that I dropped the plan. > > With Ooo CPUs speculatively executing the next iterations I very much > > doubt that. > > OoO execution is like really dumb loop unrolling, Actually I think of it as a rather clever way of unrolling. The processor has exact knowledge of (non-)dependencies, even on those the compiler can't proof to not exist. And if a dependency exists for real (such that an OOO CPU can't ignore it) how would you propose the compiler to magically remove it to get rid of the involved instructions? That can usually only be done by duplicating threads-of-compute and that increases not decreases instruction count (or leaves them the same). For instance the RTL unroller replaces IV adjusts ala: i = i + 1; use (i) i = i + 1; use (i) with i1 = i + 1; use (i1) i2 = i + 2; use (i2) So, compiler was able to get rid of one dependency chain, but in expense of having to use a new constant and register. A cost analysis needs to happen to be sure that this is worthwhile. > you still have all the dependencies between iterations, all the > branches, all the pointer increments etc. Optimizing those reduces > instruction counts like vectorization. Fewer instructions implies faster > code. Also not true in this generality. We recently had a case where vectorization hurt very much but it wasn't obvious why. Our current theory (and we had many) is that because before the vector ALU actually could start working there had to be eight (strided) loads done per vector, for multiple streams, and all those loads eventually clogged the pipeline. The scalar loop (on the OOO CPU) nicely intermingled ALU and memory ops. In the end Richi got the vector loop to be only 2-3% slower than the scalar loop (and that was a fairly major effort), even though from looking at the instructions it had only 70-80% of the original scalar dynamic instructions count. In any case, I'm with Richi on this one about having some dependable testcases that speed up with a tree unroller, reasons of why this is so, and some cost model that reflects these reasons at least roughly. All the above is of course only true for OOO CPUs. For in-order unrolling is helpful. But for getting that it would be good to work towards exchanging the RTL with a gimple unroller if must be to not have two of them. Ciao, Michael. P.S: For a very long time I also thought "hmm, gimple unroller, that would be nice to have; could get rid of RTL one". Until these experiments last year. Since then I'm basically 180 degrees around ;-)