Improvements of the haifa scheduler

Maxim Kuvyrkov Sun, 04 Mar 2007 01:53:58 -0800

Hi.

I want to share some of my thoughts and doings on improving / cleaning
up current GCC instruction scheduler (Haifa) - most of them are just
small obvious improvements.


I have semi-ready patches for about a half of them and would appreciate
any early suggestion or comments on the following draft plan:

1. Remove compute_forward_dependencies ().  [Done]

Since new representation of dependencies lists was checked in, we don't
longer need to compute forward dependencies separately.  It would be
natural to add forward links at the same time as we generate backward ones.

2. Use alloc_pools instead of obstacks for dep_nodes and deps_lists.
[In progress]

As pointed out by Jan Hubicka scheduler peaks +100M on a testcase for
PR28071 after my patch for new dependencies lists was checked in.
Though alloc_pools should have been my first choice while writing that
patch, I decided to mimic as close as possible the original rtx
instruction lists with their scheme of deallocation at the very end.  So
the next step would be to define proper lifetime of dependency lists
and use alloc_pools to reuse nodes and lists from the previous regions.

Which brings us to ...

3. Define clear interface for manipulating dependencies.  [In progress]

This one popped up when I began to debug <2> and understood that the
scheduler uses and changes dependencies lists the way it shouldn't.
Lists are being copied, edited and deleted directly without interaction
with sched-deps.c .  What the scheduler really needs is the following
set of primitives:
  o FOR_EACH_DEP (insn, which_list, iterator, dep) - walk through
insn's which_list (one of {backward, resolved_backward, forward,
resolved_forward}) and provide the user with the dep.  Ayal Zaks
suggested this type of macro weeks ago but at that time I didn't agree
with him.
  o dep_t find_dep_between (producer, consumer) - find dependency
between two instructions.  Currently we walk through the list looking
for what we need.  A better way would be to first check dependency
caches and then, if we can't determine that there is no dependency, walk
through the shorter list given the choice of two: producer's forward
list and consumer's backward list.
  o void add_dep (dep_t) - Add a dependency.
  o void remove_dep (iterator) - Remove dependency pointed out by iterator.
  o void resolve_dep (iterator) - Resolve dependency pointed out by
iterator.
  o int get_list_size (insn, which_list) - Get the size of the insn's
which_list.
  o bool list_has_only_speculative_deps (insn, which_list) - Return
true if all insn's dependencies can be overcome with some sort of
speculation.
  o void {create, delete}_dependency_lists (insn) - Create / delete
dependency lists for insn.

As you can see, the scheduler doesn't need to know the internal
representation of the deps_list / dep_node.

4. Support speculative loads into subregs.  [Planned]

As noted by Jim Wilson current support for ia64 speculation doesn't
handle subregs though that would be easy to fix.

5. Make sched-deps.c mark only those dependencies as speculative which
can actually be overcame with speculation types currently in use.  [Planned]

At the moment we first generate speculative dependencies and only at the
moment of adding instruction to the ready list we check if we can (or it
worth to) overcome every of its dependencies.

6. Make ds_t a structure.  [Planned]

ds_t is type for representing status of a dependency.  It contains
information about types of the dependency (true, output, anti) and
probabilities of speculation success (begin_data, be_in_data,
begin_control, be_in_control) - that makes three bits and for integers
coded in a single int.  Historical reasons forced this inelegant
approach but now the reasons are inexistent and the problem can be
solved in a natural way.

7. Use cse_lib in sched-rgn.c .  [In progress]

At the moment cse_lib works to improve alias analysis only during
sched-ebb scheduling.  It is trivial that we can also enable it when
scheduling single block regions in sched-rgn.  The patch for this is
a one-liner which was tested on a bootstrap but not on SPECs.

It is also possible to use cse_lib on sequential basic_blocks of the
region thus handling them as an extended basic block.

If it is possible to save cse_lib states, then we'll be able to process
trees and merging capabilities are required for DAGs.  Don't know if
this can be done.

8. Don't generate a memory barrier on simplejump.  [Done]

sched-deps.c handles every jump in the scheduling region as a memory
barrier - e.g. almost no memory operation can be moved through it.  But
unconditional jumps don't really need such restrictions.  A one-liner
patch for this was tested on the bootstrap but not on SPECs.

9. Use sched-ebb on other architectures.  [Done]

After patches for ia64 speculation and follow up fixes to them sched-ebb
no longer corrupts CFG and can safely be used as not final pass on
platforms other than ia64.  I successfully bootstrapped (and, probably,
regtested) the compiler on x86_64 with no cfg rebuild after sched-ebb
and a change in common.opt to run sched-ebb scheduler by default for
sched2 pass.  I suppose powerpc might benefit from ebb scheduling,
though don't have a SPEC box to run tests on.

10. Leveled priority.  [Done]

Haifa scheduler's main heuristic is the length of critical path.  It is
computed just before the scheduling for the whole region and this
computation has (as it looks to me) a severe drawback:
  o It is good for single block regions only.

Let's consider a diamond region if-then-else-endif:

When scheduling 'if' ready instruction from 'if', 'then', 'else' and
'endif' look [almost] equally good to the scheduler, but
instructions from only 'if' and 'endif' should.

The problem can become worse on such architectures as arm: lets say 'if'
is a small block with 3 cycle of critical path length and 'then'
is an unlikely branch with several ready instructions and a jump in the
end.  Guess what, all the ready instructions from 'then' will be moved
to 'if'.  This happens because jump on arm has 16 cycle latency and
therefore all the instructions in 'then' will get priorities starting
from 16 which is surely greater then the maximal priority of 3 for
instructions from 'if'.  There must be PR with the testcase in bugzilla
but now I can't find it.

There also are several less significant quirks in computation and use of
the priorities.

So, why leveled priority?  Because there will be 3 levels of priority:

  o priority2 - the priority computed much like it is now *but* with
respect to the probability of the execution of the consumer.

priority2(producer) = max ((priority2(consumer) + dep_cost (producer,
consumer)) * relative_probability (producer_block, consumer_block));

  o priority1 - priority2 after adjustment by target via
adjust_priority() hook.  This is done at the moment of adding
instruction to the ready list.

priority1(insn) = targetm.sched.adjust_priority(insn, priority2);

  o priority - priority actually used by scheduler.

priority(insn) = priority1(insn) * relative_probability (target_block,
insn_block);

I have a patch that implements this scheme for sched-ebb.  Make it
work for sched-rgn is not a problem - I just plan to make it first
available for ia64 sched-ebb and only after public testing for commonly
used sched-rgn as well.

11. Implement control speculation in sched-ebb.  [In progress]

Few months ago I came across a spot in sched-deps.c where dependencies
between jumps and memory operations are added.  Before I found that spot
I thought that these dependencies are added in sched-ebb.c:
add_deps_for_risky_insns () and considered it as a source for ia64
control speculation during sched2.  Surprisingly
add_deps_for_risky_insns () rarely added dependencies and none of those
become control speculative.  Now I see what went wrong.

I have a semi-ready patch for this and plan to finish it as soon as I'm
done with <2> and <3>.

12. Initialize scheduler data structures for each region separately and
index them by luid.  [Done]

This will make a smaller memory footprint and more clear scheduler
initialization infrastructure.

A variant on this work will soon be checked in on sel-sched-branch.

13. Generate regions from extended basic blocks in sched-rgn.c  [Planned]

sched-rgn.c already has all the necessary infrastructure to handle
extended basic block.  The missing part is in formation of the regions
which creates a ebb for each basic block.  Also we need to teach
sched-rgn of few optimization sched-ebb capable of (e.g. moving a jump
through partially dead instructions).  After this work is done we'll
have a choice of three types of regions:
  o single block regions
  o extended basic block regions
  o DAG of ebbs

14. Some sort of must alias analysis for scheduler.  [Planned]

At the moment ia64 data speculation is used very conservatively - only
when we can't issue anything else on the cycle.  Data speculation needs
an alias analysis that can infer the probability of two memory locations
to alias.  It doesn't need to be 100% correct as the recovery code will
be generated in any case, but on the other hand it should be capable of
finding trivial must alias relations.  Current implementation of data

speculation suffers most when a pointer 'p1' is copied to 'p2' and thenthe program writes to 'p1' and reads from 'p2'. The speculative loadsfrom 'p2' will always fail and introduce a misspeculation stall of about10 - 20 cycles. The starting point for this work will be an aliasingpatch by Sanjiv Kumar Gupta.


---

Any comments, suggestions or 'please don't do that' will be greatly
appreciated.

I should also mention that I do all these works in my spare time which
is not a lot, thus the above is my plan for about a half of the year.


Thanks,

Maxim

Improvements of the haifa scheduler

Reply via email to