On Wed, Jul 20, 2016 at 01:23:44PM +0200, Bernd Schmidt wrote:
> >>>But you need the profile to make even reasonably good decisions.
> >>
> >>I'm not worried about making cost decisions: as far as I'm concerned
> >>it's perfectly fine for that. I'm worried about correctness - you can't
> >>validly save registers inside a loop.
> >
> >Of course you can.  It needs to be paired with a restore; and we do
> >that just fine.
> > Pretty much *all* implementations in the literature do this, fwiw.

> I, however, fail to see where this happens.

See for example one of the better papers on shrink-wrapping, "Post Register
Allocation Spill Code Optimization", by Lupo and Wilken.

See the problem definition (section 2), figure 1 for a figure clearly
showing multiple save/restore (and executed more than once). section 4.2
for why we don't need to look at loops.

[ In this paper prologue/epilogue pairs are only placed around SESE
regions, which we do not have many in GCC that late in RTL (often the
CFG isn't even reducible); there is no reason to restrict to SESE regions
though ].

> If you have references to 
> somewhere where this algorithm is described, that would be helpful, 

No, of course not, because I just made this up, as should be clear.

The problem definition is simple: we have a CFG, and some of the blocks
in that CFG need some things done by the prologue done before they
execute.  We don't want to run that prologue code more often than
necessary, because it can be expensive (compared to the parts of the
function that are executed at all).  Considering all possible combinations
of blocks (or edges) where we can place a prologue is not computationally
feasible.  But there is a shortcut: if a block X gets a prologue, all
blocks dominated by it will for zero cost have that prologue established
as well (by simply not doing an epilogue until they are reached).  So
this algo does the obvious thing, simply walking the dom tree (which is
O(n)).  Then, from the prologue placement, we compute which blocks will
execute with that concern "active"; and we insert prologue/epilogue code
to make that assignment true (a prologue or epilogue for every edge that
crosses from "does not have" to "does have", or the other way around; and
then there is the head/tail thing because cross-jumping fails to unify
many of those *logues, so we take care of it manually).

> because at this stage I think I really don't understand what you're 
> trying to achieve. The submission lacks examples.

It says what it does right at the start of the head comment:

"""
   Instead of putting all of the prologue and epilogue in one spot, we
   can put parts of it in places that are executed less frequently.  The
   following code does this, for concerns that can have more than one
   prologue and epilogue, and where those pro- and epilogues can be
   executed more than once.
"""

followed by a bunch of detail.

> So I could see things could work if you place an epilogue part in the 
> last block of a loop if the start of the loop contains a corresponding 
> part of the prologue, but taking just the comment in the code:
>    Prologue concerns are placed in such a way that they are executed as
>    infrequently as possible.  Epilogue concerns are put everywhere where
>    there is an edge from a bb dominated by such a prologue concern to a
>    bb not dominated by one.
> 
> this describes no mechanism by which such a thing would happen.

Sure it does.  The edge leaving the loop, for example.

You can have a prologue/epilogue pair within a loop (which is unusual,
but *can* happen, e.g. as part of a conditional that executes almost
never -- this is quite frequent btw, assertions, on-the-run initialisation,
etc.)

The situation you describe has all the blocks in the loop needing the
prologue (but, say, nothing outside the loop).  Then of course the prologue
is placed on the entry (edge) into the loop, and the epilogue on the exit
edge(s).

> And I 
> fail to see how moving parts of the prologue into a loop would be 
> beneficial as an optimization.

for (i = 0; i < 10; i++)
        if (less_than_one_in_ten_times)
                do_something_that_needs_a_prologue;

or

for (i = 0; i < 10; i++)
        if (whatever)
                do_something_that_needs_a_prologue_and_does_not_return;

or whatever other situation.  We do not have natural loops, often.  The
algorithm places prologues so that their dynamic execution frequency is
optimal, which results in their dynamic execution being optimal, whatever
the CFG looks like.


Segher

Reply via email to