"Bingfeng Mei" <b...@broadcom.com> wrote on 05/02/2009 12:45:20:

> Ayal,
> OOPs, your mail skipped my inbox and I missed it for several days.
>

I'm chasing after my mail as well..

> Using testsuite/gcc.dg/sms-6.c as an example and compiling it for
PowerPC,
> node 18 (see attachment) is in a SCC and cannot be scheduled until
spliting
> twice. The MII = 20 and the schedule can only  be found at II = 24.

Yes, I see. This example raises a couple of issues:

o The first row split (from II=20 to II=21) is miscalculated; it should be
row 20=0 instead of 19. Splitting row 19 cannot help schedule node 18, and
indeed we immediately split another row. We're now checking a small patch
to fix this, which should save one cycle of II in the above example.

o The swinging scheduling procedure tends to create sets of densly
scheduled instructions with holes between them. For example, the set of
nodes 3, 7, 11, 15 first scheduled adjacently before (without leaving room
for) node 18 (in cycles 6 (=26 mod 20), 25, 24, 23 respectively). So it may
be possible to delete some empty row from such a hole (suffices to check a
single row from each hole), after we insert an empty row to accommodate an
instruction, thereby maintaining the II. This should save at-least one
additional cycle of II in the above example.

o Transitive closure can indeed prevent the construction of infeasible
partial schedules. Note that you'd be better off with a dense incidence
representation for the DDG, as it will become a complete graph. The
transitive DDG could be pruned according to the scheduling order, as we're
only interested in adding edge u -> v if there is some w on a path u ~> w
~> v where w follows both u and v in the to-be-scheduled order; but for
that you probably need to have transitive edges u -> w and w -> v ...
Plus, leaving a single cycle for w to be scheduled between u and v may not
be enough, as other instructions or conflicts may prevent w from eventually
being scheduled in this cycle. So the split (&remove) row mechanism may
still be useful.

o Another preventive (rather than corrective) action can be to prioritorize
register dependent nodes over memory dependent ones. Note that in the
example above, the 4 predecessors (7,11,15,18) of node 3 are equally
critical, and we're missing some tie-breaking criteria. It may be better to
prioritize node 18 due to its register dependence (the general motivation
being that near-by nodes imply shorter register live-ranges) over the other
nodes of output memory dependence. It's not easy however to extend the
current infrastructure to consider such criteria.

> On our 2-
> issue VLIW, the MII=10 and the valid schedule can only be found at II =
14. It
> is not great since we want to maximize performance.
>
> I had experience (in development of another compiler) on this issue by
> constructing the MinDist matrix and using it to calculate schedule window
for
> each instruction. However, speed was not a real concern then. Do you know

> better algorithm (better than O(N^3))?

No, I don't. As mentioned above, we might be able to save something by not
calculating the complete closure.

> Or do you think it is not so crtical
> here? After all, not many loops are candidates for software pipelining.
Thanks.
>

I would suggest to check the actual compile-time and memory usage, and try
to reduce them later if needed. Modulo-scheduling is not a compile-time
cheap optimization in general.
Ayal.



> Cheers,
> Bingfeng
>
> > -----Original Message-----
> > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On
> > Behalf Of Ayal Zaks
> > Sent: 01 February 2009 23:18
> > To: Bingfeng Mei
> > Cc: Adrian Ashley; gcc@gcc.gnu.org
> > Subject: Re: Solve transitive closure issue in modulo scheduling
> >
> > "Bingfeng Mei" <b...@broadcom.com> wrote on 30/01/2009 14:44:01:
> >
> > > Hello,
> > > I try to make modulo scheduling work more efficiently for our VLIW
> > target. I
> > > found one serious issue that prevents current SMS algorithm from
> > achieving
> > > high IPC is so-called "transitive closure" problem, where scheduling
> > window is
> > > only calculated using direct predecessors and successors.
> > Because SMS is
> > not
> > > an iterative algorithm, this may cause failures in finding a valid
> > schedule.
> >
> > Agreed.
> >
> > > Without splitting rows, some simple loops just cannot be
> > scheduled not
> > matter
> > > how big the II is. With splitting rows, schedule can be
> > found, but only
> > at
> > > bigger II.
> >
> > It may happen that even splitting rows will not help, e.g. when we
> > repeatedly end up with a node having negative sched window.
> >
> > > GCC wiki (http://gcc.gnu.org/wiki/SwingModuloScheduling) lists this
> > > as a TODO. Is there any work going on about this issue
> >
> > No, not to my knowledge. We had some testcase where this
> > showed up, hence
> > its appearance in the TODO, but afaicr some change caused it
> > to disappear.
> >
> > > (the last wiki update
> > > was one year ago)? If no one is working on it, I plan to do
> > it. My idea
> > is to
> > > use the MinDist algorithm described in B. Rau's classic
> > paper "iterative
> > > modulo
> > scheduling"
> > (>http://www.hpl.hp.com/techreports/94/HPL-94-115.html). The
> > > same algorithm can also be used to compute better RecMII.
> > The biggest
> > concern
> > > is complexity of computing MinDist matrix, which is O(N^3).
> > N is number
> > of
> > > nodes in the loop. I remember somewhere GCC coding guide says "never
> > write
> > > quadratic algorithm" :-) Is this an absolute requirement?
> > If yes, I will
> > keep
> > > it as our target-specific code (we are less concerned about
> > compilation
> > time).
> > > Otherwise, I will try to make it more generic to see if it
> > can make into
> > > mainline in 4.5. Any comments?
> > >
> >
> > The problem appears only when the DDG is cyclic, and for
> > every cycle in the
> > DDG, the problem may arise when trying to schedule the last
> > node of the
> > cycle, which has both predecessors and successors already
> > scheduled. So you
> > might try to connect only each such predecessor to every such
> > successor
> > with a transitive arc, to ensure that this last node will
> > have a non-empty
> > scheduling window. But this might not suffice; you may
> > eventually need to
> > wire (nearly) every pair of nodes in the strongly connected
> > component. If
> > this happens, you'd be better off with a dense graph representation
> > (incidence matrix) than the current sparse one (adjaceny lists).
> >
> > An example would help see things more clearly. If you have a
> > (small :) DDG
> > demonstrating the need for transitive arcs, I'd be happy to
> > have a look and
> > advise what should be done.
> >
> > Ayal.
> >
> > > Cheers,
> > > Bingfeng Mei
> > >
> > > Broadcom UK
> > >
> > >
> >
> >
> > [attachment "sms-6.c.169r.sms" deleted by Ayal Zaks/Haifa/IBM]

Reply via email to