On Mon, Oct 10, 2011 at 1:57 PM, Richard Sandiford
<[email protected]> wrote:
> Ayal Zaks <[email protected]> writes:
>>> I agree it's natural to schedule moves for intra-iteration dependencies
>>> in the normal get_sched_window way. But suppose we have a dependency:
>>>
>>> A --(T,N,1)--> B
>>>
>>> that requires two moves M1 and M2. If we think in terms of cycles
>>> (in the SCHED_TIME sense), then this effectively becomes:
>>>
>>> A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>>>
>>> because it is now M1 that is fed by both the loop and the incoming edge.
>>> But if there is a second dependency:
>>>
>>> A --(T,M,0)--> C
>>>
>>> that also requires two moves, we end up with:
>>>
>>> A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>>> -->(T,M3,-1)--> B
>>>
>>> and dependence distances of -1 feel a bit weird. :-) Of course,
>>> what we really have are two parallel dependencies:
>>>
>>> A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>>>
>>> A --(T,M1,0)--> M1' -->(T,M2,0)--> M2' -->(T,N3,0)--> B
>>>
>>> where M1' and M2' occupy the same position as M1 and M2 in the schedule,
>>> but are one stage further along. But we only schedule them once,
>>> so if we take the cycle/SCHED_TIME route, we have to introduce
>>> dependencies of distance -1.
>>>
>>
>> Interesting; had to digest this distance 1 business, a result of
>> thinking in cycles instead of rows (or conversely), and mixing
>> dependences with scheduling; here's my understanding, based on your
>> explanations:
>>
>> Suppose a Use is truely dependent on a Def, where both have been
>> scheduled at some absolute cycles; think of them as timing the first
>> iteration of the loop.
>> Assume first that Use appears originally after Def in the original
>> instruction sequence of the loop (dependence distance 0). In this
>> case, Use requires register moves if its distance D from Def according
>> to the schedule is more than ii cycles long -- by the time Use is
>> executed, the value it needs is no longer available in the def'd
>> register due to intervening occurrences of Def. So in this case, the
>> first reg-move (among D/ii) should appear after Def, recording its
>> value before the next occurrence of Def overwrites it, and feeding
>> subsequent moves as needed before each is overwritten. Thus the
>> scheduling window of this first reg-move is within (Def, Def+ii).
>>
>> Now, suppose Use appears before Def, i.e., Use is upwards-exposed; if
>> it remains scheduled before the Def, no reg-move is needed. If, otoh,
>> Def is scheduled (D>0 cycles) before Use, breaking the anti-dependence
>> between them, (D/ii + 1) reg-moves are needed in order to feed Use
>> with the live-in value before Def. So in this case, the first reg-move
>> should appear before Def (again feeding subsequent moves as needed
>> before each is overwritten). Thus the scheduling window of this first
>> reg-move is within (Def-ii, Def).
>>
>> In any case, each use requires a specific number of reg-moves, which
>> begin either before or after the def; it is always safe (though
>> potentially redundant) to start reg-moving before the def -- uses that
>> do not need the live-in value will ignore it by feeding from a later
>> reg-move.
>
> Right. The distance on the Def->Use dependency is effectively transferred
> to the dependency between the Def and first move.
>
> And we can potentially have both kinds of Use at the same time.
> We then end up scheduling two moves, one in each window, but require
> them to occupy the same row and column. It feels more convenient to
> schedule the earlier of the two moves.
>
>> Q: if we generate reg-moves starting from before the def, would
>> redundant reg-moves be eliminated if no use needs access to live-in
>> value? If so, would that simplify their generation? (If not, it may be
>> interesting to understand how to improve DCE to catch it.)
>
> To be clear, the new version of the patch (unlike the old one)
> doesn't emit reg-moves before the def if all uses are distance 0.
> We only do it where some or all uses are distance 1. The first move
> before the def is always needed in that case.
>
Understood. The question was whether it would simplify things if we
always emit reg-moves before the def. And rely on DCE to hopefully
eliminate redundancies.
> So redudant moves are confined to the case where (a) we have more
> than one move, (b) we have both distance 0 and distance 1 uses, and
> (c) one of the two distance sets requires more moves than the other.
> If the distance 0 dependencies require more moves than the distance
> 1 dependencies, we will have a redudant move in the prologue.
> If the distance 1 dependencies require more moves than the distance
> 0 dependencies, we will have a redudant move in the epilogue.
> But I believe that this is already handled by dce.
>
> (For avoidance of doubt, the new code is more accurate than
> current trunk in this respect.)
>
>> The issue of assigning stages to reg-moves is mostly relevant for
>> prolog and epilog generation, which requires and receives special
>> attention -- handled very nicely by ps_num_consecutive_stages! Note
>> that currently a simple boolean indicator for (the exceptional case
>> of) double stages would suffice, instead of generalizing to arbitrary
>> nums of consecutive stages (see any potential use for them?).
>
> Not in the immediate term. But I think having a boolean indicator
> would be inconsistent. If the distance field is an int (even though
> we only expect distance-0 and distance-1 register dependencies)
> then I think the number of stages should be too.
>
> I did wonder originally about using a boolean, but IMO, it makes
> the code less readable rather than more. Instead of a simple
> range check like:
>
> if (first_stage_for_insn <= last_stage_in_range
> && last_stage_for_insn >= first_stage_in_range)
>
> we end up with the equivalent of:
>
> if (first_stage_for_insn <= last_stage_in_range
> && (double_stage_move_p (...)
> ? first_stage_for_insn + 1 >= first_stage_in_range
> : first_stage_for_insn >= first_stage_in_range))
>
> with no corresponding simplification elsewhere.
>
Sure. But setting the range can be done by consulting an simple
indicator, rather than generalizing to arbitrary stage numbers; e.g.:
+ps_num_consecutive_stages (partial_schedule_ptr ps, int id)
+{
+ if (id >= ps->g->num_nodes && ps_reg_move (ps, id)->double_stages)
+ return 2;
+ else
+ return 1;
+}
or
- last_u = first_u + ps_num_consecutive_stages (ps, u) - 1;
+ if (...double_stages) last_u = first_u + 1;
+ else last_u = first_u;
>>> However...
>>>
>>>> Also note that if moves are assigned absolute cycles, it becomes clear
>>>> to which stage each belongs (just like any other instruction),
>>>> regulating their generation in prolog and epilog.
>>>
>>> ...I have to concede that it makes the stage calculation easier,
>>> and that that tips the balance. (Although again, a move can belong
>>> to two stages simultanouesly.)
>>>
>>> Anyway, here's an updated patch that uses cycle times. I've also
>>> dropped the code that tried to allow windows to start and end on
>>> the same row (and therefore schedule "either side" of that row).
>>> I thought it might help with some VLIWy DFAs, but it was always
>>> going to be of minor benefit, and we don't try that elsewhere,
>>> so let's avoid the complication.
>>>
>>
>> ok. Such windows seem rare, as they imply zero move (or def->move)
>> latencies, iinm. Leaving a note behind would be good.
>
> OK. Since this same restriction applies to the amin scheduling
> window code, I suppose the natural place would be in the algorithm
> description itself:
>
> /* The SMS scheduling algorithm itself
> -----------------------------------
> Input: 'O' an ordered list of insns of a loop.
> Output: A scheduling of the loop - kernel, prolog, and epilogue.
> ...
> 42. compute epilogue & prologue
> 43. finish - succeeded to schedule
> */
>
> E.g. adding something like this at the end:
>
> ??? The algorithm restricts the scheduling window to II cycles.
> In rare cases, it may be better to allow windows of II+1 cycles.
> The window would then start and end on the same row, but with
> different "must precede" and "must follow" requirements.
>
> Let me know what you think and I'll add it as a follow-on patch.
>
great, thanks.
>>> Bootstrapped & regression-tested on powerpc64-linux-gnu with
>>> -fmodulo-sched and -fmodulo-sched-allow-regmoves enabled by default.
>>> Also retested on the ARM benchmarks. OK to install?
>>>
>>
>> Yes, OK.
>> Some points to consider raised above, and some comments added below.
>
> Thanks, applied with the changes below.
>
>>> @@ -466,21 +506,167 @@ print_node_sched_params (FILE *file, int
>>> for (i = 0; i < num_nodes; i++)
>>> {
>>> node_sched_params_ptr nsp = SCHED_PARAMS (i);
>>> - int j;
>>>
>>> fprintf (file, "Node = %d; INSN = %d\n", i,
>>> INSN_UID (ps_rtl_insn (ps, i)));
>>> fprintf (file, " asap = %d:\n", NODE_ASAP (&ps->g->nodes[i]));
>>> fprintf (file, " time = %d:\n", nsp->time);
>>> - fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
>>> - for (j = 0; j < nsp->nreg_moves; j++)
>>> - {
>>> - ps_reg_move_info *move = ps_reg_move (ps, nsp->first_reg_move +
>>> j);
>>
>> Is another way provided to print a summary of the regmoves? Or does
>> the info dumped while scheduling each one suffice for practical
>> tracing and debugging.
>
> Yeah, the new info should be more complete. It gives the pattern
> (which is what we dump here), but also lists the producers and
> consumers of each move (which we don't print here).
>
>>> + fprintf (file, " stage = %d:\n", nsp->stage);
>>> + }
>>> +}
>>> +
>>> +/* Set SCHED_COLUMN for each instruction in row ROW of PS. */
>>> +static void
>>> +set_columns_for_row (partial_schedule_ptr ps, int row)
>>> +{
>>> + ps_insn_ptr cur_insn;
>>> + int column;
>>> +
>>> + column = 0;
>>> + for (cur_insn = ps->rows[row]; cur_insn; cur_insn =
>>> cur_insn->next_in_row)
>>> + SCHED_COLUMN (cur_insn->id) = column++;
>>> +}
>>> +
>>> +/* Set SCHED_COLUMN for each instruction in PS. */
>>> +static void
>>> +set_columns_for_ps (partial_schedule_ptr ps)
>>> +{
>>> + int row;
>>> +
>>> + for (row = 0; row < ps->ii; row++)
>>> + set_columns_for_row (ps, row);
>>> +}
>>> +
>>> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
>>> + Its predecessors and successors have already been scheduled.
>>
>> well, except for some (preceeding or succeeding?) moves that have not
>> yet been scheduled, right?
>
> Ah, yeah, fixed to:
>
> /* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
> Its single predecessor has already been scheduled, as has its
> ddg node successors. (The move may have also another move as its
> successor, in which case that successor will be scheduled later.)
>
perfect
>>> +
>>> + The move is part of a chain that satisfies register dependencies
>>> + between a producing ddg node and various consuming ddg nodes.
>>> + If some of these dependencies cross a loop iteration (that is,
>>> + have a distance of 1) then DISTANCE1_USES is nonnull and contains
>>> + the set of uses with distance-1 dependencies. DISTANCE1_USES
>>> + is null otherwise.
>>> +
>>
>> Maybe clarify that they are upwards-exposed or live-in uses.
>
> OK, changed to:
>
> The move is part of a chain that satisfies register dependencies
> between a producing ddg node and various consuming ddg nodes.
> If some of these dependencies have a distance of 1 (meaning that
> the use is upward-exposoed) then DISTANCE1_USES is nonnull and
exposed (typo)
> contains the set of uses with distance-1 dependencies.
> DISTANCE1_USES is null otherwise.
>
>>> + MUST_FOLLOW is a scratch bitmap that is big enough to hold
>>> + all current ps_insn ids.
>>> +
>>> + Return true on success. */
>>> +static bool
>>> +schedule_reg_move (partial_schedule_ptr ps, int i_reg_move,
>>> + sbitmap distance1_uses, sbitmap must_follow)
>>> +{
>>> + unsigned int u;
>>> + int this_time, this_distance, this_start, this_end, this_latency;
>>> + int start, end, c, ii;
>>> + sbitmap_iterator sbi;
>>> + ps_reg_move_info *move;
>>> + rtx this_insn;
>>> + ps_insn_ptr psi;
>>> +
>>> + move = ps_reg_move (ps, i_reg_move);
>>> + ii = ps->ii;
>>> + if (dump_file)
>>> + {
>>> + fprintf (dump_file, "Scheduling register move INSN %d; ii = %d"
>>> + ", min cycle = %d\n\n", INSN_UID (move->insn), ii,
>>> + PS_MIN_CYCLE (ps));
>>> + print_rtl_single (dump_file, move->insn);
>>> + fprintf (dump_file, "\n%11s %11s %5s\n", "start", "end", "time");
>>> + fprintf (dump_file, "=========== =========== =====\n");
>>> + }
>>> +
>>> + start = INT_MIN;
>>> + end = INT_MAX;
>>> +
>>> + /* For dependencies of distance 1 between a producer ddg node A
>>> + and consumer ddg node B, we have a chain of dependencies:
>>> +
>>> + A --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B
>>> +
>>> + where Mi is the ith move. For dependencies of distance 0 between
>>> + a producer ddg node A and consumer ddg node C, we have a chain of
>>> + dependencies:
>>> +
>>> + A --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C
>>> +
>>> + where Mi' occupies the same position as Mi but occurs a stage later.
>>> + We can only schedule each move once, so if we have both types of
>>> + chain, we model the second as:
>>> +
>>> + A --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C
>>> +
>>> + First handle the dependencies between the previously-scheduled
>>> + predecessor and the move. */
>>> + this_insn = ps_rtl_insn (ps, move->def);
>>> + this_latency = insn_latency (this_insn, move->insn);
>>> + this_distance = distance1_uses && move->def < ps->g->num_nodes ? 1 : 0;
>>> + this_time = SCHED_TIME (move->def) - this_distance * ii;
>>> + this_start = this_time + this_latency;
>>> + this_end = this_time + ii;
>>> + if (dump_file)
>>> + fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
>>> + this_start, this_end, SCHED_TIME (move->def),
>>> + INSN_UID (this_insn), this_latency, this_distance,
>>> + INSN_UID (move->insn));
>>> +
>>> + if (start < this_start)
>>
>> redundant; start=INT_MIN is surely < this_start.
>>
>>> + start = this_start;
>>> + if (end > this_end)
>>
>> redundant; end=INT_MAX is surely > this_end.
>
> I did this deliberately because the order in which we apply the limits
> isn't important. Making them assymetrical by applying this kind of
> micro-optimisation is IMO a bad idea. The compiler will do it for us.
>
> E.g. I originally had an extra limit to make sure that we didn't add
> new stages. If we applied the optimisation by hand, and then someone
> added a new limit like that later, they'd have to be careful about
> where they put it.
>
ok, np.
>>> + end = this_end;
>>> +
>>> + /* Handle the dependencies between the move and previously-scheduled
>>> + successors. */
>>
>> (maybe assert they have indeed all been scheduled)
>
> Hmm, I don't think that'd be useful. We've already read the schedule
> time (and row and column) by this point, and we don't assert the same
> thing elsewhere (in the code that assigns moves to uses, or in
> get_sched_window itself).
>
(ok; only a thought, hence the parentheses)
>>> + EXECUTE_IF_SET_IN_SBITMAP (move->uses, 0, u, sbi)
>>> + {
>>> + this_insn = ps_rtl_insn (ps, u);
>>> + this_latency = insn_latency (move->insn, this_insn);
>>> + if (distance1_uses && !TEST_BIT (distance1_uses, u))
>>
>> shouldn't this be if (distance1_uses && TEST_BIT (distance1_uses, u))?
>>
>>> + this_distance = -1;
>>> + else
>>> + this_distance = 0;
>
> No, this condition corresponds to:
>
> /* For dependencies of distance 1 between a producer ddg node A
> and consumer ddg node B, we have a chain of dependencies:
>
> A --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B
>
> where Mi is the ith move. For dependencies of distance 0 between
> a producer ddg node A and consumer ddg node C, we have a chain of
> dependencies:
>
> A --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C
>
> where Mi' occupies the same position as Mi but occurs a stage later.
> We can only schedule each move once, so if we have both types of
> chain, we model the second as:
>
> A --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C
>
> First handle the dependencies between the previously-scheduled
> predecessor and the move. */
>
> i.e. we want a distance of -1 when (a) the original definition has uses
> with both distance 0 and distance 1 and (b) the particular use we're
> looking at has distance 0.
>
ah, right
>>> + this_time = SCHED_TIME (u) + this_distance * ii;
>>> + this_start = this_time - ps->ii;
>>
>> use ii instead of ps->ii
>
> Oops, fixed.
>
>>> + this_end = this_time - this_latency;
>>> + if (dump_file)
>>> + fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
>>> + this_start, this_end, SCHED_TIME (u), INSN_UID
>>> (move->insn),
>>> + this_latency, this_distance, INSN_UID (this_insn));
>>> +
>>> + if (start < this_start)
>>> + start = this_start;
>>> + if (end > this_end)
>>> + end = this_end;
>>> + }
>>> +
>>> + if (dump_file)
>>> + {
>>> + fprintf (dump_file, "----------- ----------- -----\n");
>>> + fprintf (dump_file, "%11d %11d %5s %s\n", start, end, "", "(max,
>>> min)");
>>> + }
>>>
>>> - fprintf (file, " reg_move = ");
>>> - print_rtl_single (file, move->insn);
>>> + sbitmap_zero (must_follow);
>>> + SET_BIT (must_follow, move->def);
>>> +
>>> + start = MAX (start, end - (ii - 1));
>>
>> ok, so this excludes considering both ends of a possible ii-cycled
>> window. Such a window would imply zero move (or def->move) latencies
>> iinm, and more care in setting must_follow/must_precede.
>
> Right.
>
>>> + for (c = end; c >= start; c--)
>>> + {
>>> + psi = ps_add_node_check_conflicts (ps, i_reg_move, c,
>>> + move->uses, must_follow);
>>
>> ok - def must_follow the move, if both are scheduled on same row (as
>> we potentially start from this (end) row moving backwards), but uses
>> should precede the move, assuming that if move and use are on same row
>> the sched distance between them is ii (i.e., non-zero move->use
>> latency)
>
> Right.
>
>>> @@ -1513,6 +1651,19 @@ sms_schedule (void)
>>> continue;
>>> }
>>>
>>> + /* Moves that handle incoming values might have been added
>>> + to a new first stage. It's too late to try any kind of
>>> + rotation, so just bump the stage count. */
>>
>> hmm, one could still apply the rotation optimization at this time if
>> desired, no?
>
> Hmm, maybe :-) I changed the comment to:
>
> /* Moves that handle incoming values might have been added
> to a new first stage. Bump the stage count if so.
>
> ??? Perhaps we could consider rotating the schedule here
> instead? */
>
Great.
Thanks,
Ayal.
> Richard
>
>
> gcc/
> * modulo-sched.c (ps_reg_move_info): Add num_consecutive_stages.
> (SCHED_FIRST_REG_MOVE, SCHED_NREG_MOVES): Delete.
> (node_sched_params): Remove first_reg_move and nreg_moves.
> (ps_num_consecutive_stages, extend_node_sched_params): New functions.
> (update_node_sched_params): Move up file.
> (print_node_sched_params): Print the stage. Don't dump info related
> to first_reg_move and nreg_moves.
> (set_columns_for_row): New function.
> (set_columns_for_ps): Move up file and use set_columns_for_row.
> (schedule_reg_move): New function.
> (schedule_reg_moves): Call extend_node_sched_params and
> schedule_reg_move. Extend size of uses bitmap. Initialize
> num_consecutive_stages. Return false if a move could not be
> scheduled.
> (apply_reg_moves): Don't emit moves here.
> (permute_partial_schedule): Handle register moves.
> (duplicate_insns_of_cycles): Remove for_prolog. Emit moves according
> to the same stage-count test as ddg nodes.
> (generate_prolog_epilog): Update calls accordingly.
> (sms_schedule): Allow move-scheduling to add a new first stage.
>
> Index: gcc/modulo-sched.c
> ===================================================================
> --- gcc/modulo-sched.c 2011-10-10 11:03:23.278273801 +0100
> +++ gcc/modulo-sched.c 2011-10-10 12:24:44.539932692 +0100
> @@ -153,6 +153,9 @@ struct ps_reg_move_info
> rtx old_reg;
> rtx new_reg;
>
> + /* The number of consecutive stages that the move occupies. */
> + int num_consecutive_stages;
> +
> /* An instruction that sets NEW_REG to the correct value. The first
> move associated with DEF will have an rhs of OLD_REG; later moves
> use the result of the previous move. */
> @@ -218,8 +221,6 @@ static partial_schedule_ptr sms_schedule
> static void permute_partial_schedule (partial_schedule_ptr, rtx);
> static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
> rtx, rtx);
> -static void duplicate_insns_of_cycles (partial_schedule_ptr,
> - int, int, int, rtx);
> static int calculate_stage_count (partial_schedule_ptr, int);
> static void calculate_must_precede_follow (ddg_node_ptr, int, int,
> int, int, sbitmap, sbitmap,
> sbitmap);
> @@ -233,8 +234,6 @@ #define NODE_ASAP(node) ((node)->aux.cou
>
> #define SCHED_PARAMS(x) VEC_index (node_sched_params, node_sched_param_vec,
> x)
> #define SCHED_TIME(x) (SCHED_PARAMS (x)->time)
> -#define SCHED_FIRST_REG_MOVE(x) (SCHED_PARAMS (x)->first_reg_move)
> -#define SCHED_NREG_MOVES(x) (SCHED_PARAMS (x)->nreg_moves)
> #define SCHED_ROW(x) (SCHED_PARAMS (x)->row)
> #define SCHED_STAGE(x) (SCHED_PARAMS (x)->stage)
> #define SCHED_COLUMN(x) (SCHED_PARAMS (x)->column)
> @@ -244,15 +243,6 @@ typedef struct node_sched_params
> {
> int time; /* The absolute scheduling cycle. */
>
> - /* The following field (first_reg_move) is the ps_insn id of the first
> - register-move instruction added to handle the modulo-variable-expansion
> - of the register defined by this node. This register-move copies the
> - original register defined by the node. */
> - int first_reg_move;
> -
> - /* The number of register-move instructions added. */
> - int nreg_moves;
> -
> int row; /* Holds time % ii. */
> int stage; /* Holds time / ii. */
>
> @@ -344,6 +334,17 @@ ps_first_note (partial_schedule_ptr ps,
> return ps->g->nodes[id].first_note;
> }
>
> +/* Return the number of consecutive stages that are occupied by
> + partial schedule instruction ID in PS. */
> +static int
> +ps_num_consecutive_stages (partial_schedule_ptr ps, int id)
> +{
> + if (id < ps->g->num_nodes)
> + return 1;
> + else
> + return ps_reg_move (ps, id)->num_consecutive_stages;
> +}
> +
> /* Given HEAD and TAIL which are the first and last insns in a loop;
> return the register which controls the loop. Return zero if it has
> more than one occurrence in the loop besides the control part or the
> @@ -456,6 +457,45 @@ set_node_sched_params (ddg_ptr g)
> node_sched_param_vec, g->num_nodes);
> }
>
> +/* Make sure that node_sched_param_vec has an entry for every move in PS. */
> +static void
> +extend_node_sched_params (partial_schedule_ptr ps)
> +{
> + VEC_safe_grow_cleared (node_sched_params, heap, node_sched_param_vec,
> + ps->g->num_nodes + VEC_length (ps_reg_move_info,
> + ps->reg_moves));
> +}
> +
> +/* Update the sched_params (time, row and stage) for node U using the II,
> + the CYCLE of U and MIN_CYCLE.
> + We're not simply taking the following
> + SCHED_STAGE (u) = CALC_STAGE_COUNT (SCHED_TIME (u), min_cycle, ii);
> + because the stages may not be aligned on cycle 0. */
> +static void
> +update_node_sched_params (int u, int ii, int cycle, int min_cycle)
> +{
> + int sc_until_cycle_zero;
> + int stage;
> +
> + SCHED_TIME (u) = cycle;
> + SCHED_ROW (u) = SMODULO (cycle, ii);
> +
> + /* The calculation of stage count is done adding the number
> + of stages before cycle zero and after cycle zero. */
> + sc_until_cycle_zero = CALC_STAGE_COUNT (-1, min_cycle, ii);
> +
> + if (SCHED_TIME (u) < 0)
> + {
> + stage = CALC_STAGE_COUNT (-1, SCHED_TIME (u), ii);
> + SCHED_STAGE (u) = sc_until_cycle_zero - stage;
> + }
> + else
> + {
> + stage = CALC_STAGE_COUNT (SCHED_TIME (u), 0, ii);
> + SCHED_STAGE (u) = sc_until_cycle_zero + stage - 1;
> + }
> +}
> +
> static void
> print_node_sched_params (FILE *file, int num_nodes, partial_schedule_ptr ps)
> {
> @@ -466,21 +506,169 @@ print_node_sched_params (FILE *file, int
> for (i = 0; i < num_nodes; i++)
> {
> node_sched_params_ptr nsp = SCHED_PARAMS (i);
> - int j;
>
> fprintf (file, "Node = %d; INSN = %d\n", i,
> INSN_UID (ps_rtl_insn (ps, i)));
> fprintf (file, " asap = %d:\n", NODE_ASAP (&ps->g->nodes[i]));
> fprintf (file, " time = %d:\n", nsp->time);
> - fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
> - for (j = 0; j < nsp->nreg_moves; j++)
> - {
> - ps_reg_move_info *move = ps_reg_move (ps, nsp->first_reg_move + j);
> + fprintf (file, " stage = %d:\n", nsp->stage);
> + }
> +}
> +
> +/* Set SCHED_COLUMN for each instruction in row ROW of PS. */
> +static void
> +set_columns_for_row (partial_schedule_ptr ps, int row)
> +{
> + ps_insn_ptr cur_insn;
> + int column;
> +
> + column = 0;
> + for (cur_insn = ps->rows[row]; cur_insn; cur_insn = cur_insn->next_in_row)
> + SCHED_COLUMN (cur_insn->id) = column++;
> +}
> +
> +/* Set SCHED_COLUMN for each instruction in PS. */
> +static void
> +set_columns_for_ps (partial_schedule_ptr ps)
> +{
> + int row;
> +
> + for (row = 0; row < ps->ii; row++)
> + set_columns_for_row (ps, row);
> +}
> +
> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
> + Its single predecessor has already been scheduled, as has its
> + ddg node successors. (The move may have also another move as its
> + successor, in which case that successor will be scheduled later.)
>
> - fprintf (file, " reg_move = ");
> - print_rtl_single (file, move->insn);
> + The move is part of a chain that satisfies register dependencies
> + between a producing ddg node and various consuming ddg nodes.
> + If some of these dependencies have a distance of 1 (meaning that
> + the use is upward-exposoed) then DISTANCE1_USES is nonnull and
> + contains the set of uses with distance-1 dependencies.
> + DISTANCE1_USES is null otherwise.
> +
> + MUST_FOLLOW is a scratch bitmap that is big enough to hold
> + all current ps_insn ids.
> +
> + Return true on success. */
> +static bool
> +schedule_reg_move (partial_schedule_ptr ps, int i_reg_move,
> + sbitmap distance1_uses, sbitmap must_follow)
> +{
> + unsigned int u;
> + int this_time, this_distance, this_start, this_end, this_latency;
> + int start, end, c, ii;
> + sbitmap_iterator sbi;
> + ps_reg_move_info *move;
> + rtx this_insn;
> + ps_insn_ptr psi;
> +
> + move = ps_reg_move (ps, i_reg_move);
> + ii = ps->ii;
> + if (dump_file)
> + {
> + fprintf (dump_file, "Scheduling register move INSN %d; ii = %d"
> + ", min cycle = %d\n\n", INSN_UID (move->insn), ii,
> + PS_MIN_CYCLE (ps));
> + print_rtl_single (dump_file, move->insn);
> + fprintf (dump_file, "\n%11s %11s %5s\n", "start", "end", "time");
> + fprintf (dump_file, "=========== =========== =====\n");
> + }
> +
> + start = INT_MIN;
> + end = INT_MAX;
> +
> + /* For dependencies of distance 1 between a producer ddg node A
> + and consumer ddg node B, we have a chain of dependencies:
> +
> + A --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B
> +
> + where Mi is the ith move. For dependencies of distance 0 between
> + a producer ddg node A and consumer ddg node C, we have a chain of
> + dependencies:
> +
> + A --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C
> +
> + where Mi' occupies the same position as Mi but occurs a stage later.
> + We can only schedule each move once, so if we have both types of
> + chain, we model the second as:
> +
> + A --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C
> +
> + First handle the dependencies between the previously-scheduled
> + predecessor and the move. */
> + this_insn = ps_rtl_insn (ps, move->def);
> + this_latency = insn_latency (this_insn, move->insn);
> + this_distance = distance1_uses && move->def < ps->g->num_nodes ? 1 : 0;
> + this_time = SCHED_TIME (move->def) - this_distance * ii;
> + this_start = this_time + this_latency;
> + this_end = this_time + ii;
> + if (dump_file)
> + fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
> + this_start, this_end, SCHED_TIME (move->def),
> + INSN_UID (this_insn), this_latency, this_distance,
> + INSN_UID (move->insn));
> +
> + if (start < this_start)
> + start = this_start;
> + if (end > this_end)
> + end = this_end;
> +
> + /* Handle the dependencies between the move and previously-scheduled
> + successors. */
> + EXECUTE_IF_SET_IN_SBITMAP (move->uses, 0, u, sbi)
> + {
> + this_insn = ps_rtl_insn (ps, u);
> + this_latency = insn_latency (move->insn, this_insn);
> + if (distance1_uses && !TEST_BIT (distance1_uses, u))
> + this_distance = -1;
> + else
> + this_distance = 0;
> + this_time = SCHED_TIME (u) + this_distance * ii;
> + this_start = this_time - ii;
> + this_end = this_time - this_latency;
> + if (dump_file)
> + fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
> + this_start, this_end, SCHED_TIME (u), INSN_UID (move->insn),
> + this_latency, this_distance, INSN_UID (this_insn));
> +
> + if (start < this_start)
> + start = this_start;
> + if (end > this_end)
> + end = this_end;
> + }
> +
> + if (dump_file)
> + {
> + fprintf (dump_file, "----------- ----------- -----\n");
> + fprintf (dump_file, "%11d %11d %5s %s\n", start, end, "", "(max,
> min)");
> + }
> +
> + sbitmap_zero (must_follow);
> + SET_BIT (must_follow, move->def);
> +
> + start = MAX (start, end - (ii - 1));
> + for (c = end; c >= start; c--)
> + {
> + psi = ps_add_node_check_conflicts (ps, i_reg_move, c,
> + move->uses, must_follow);
> + if (psi)
> + {
> + update_node_sched_params (i_reg_move, ii, c, PS_MIN_CYCLE (ps));
> + if (dump_file)
> + fprintf (dump_file, "\nScheduled register move INSN %d at"
> + " time %d, row %d\n\n", INSN_UID (move->insn), c,
> + SCHED_ROW (i_reg_move));
> + return true;
> }
> }
> +
> + if (dump_file)
> + fprintf (dump_file, "\nNo available slot\n\n");
> +
> + return false;
> }
>
> /*
> @@ -508,6 +696,9 @@ schedule_reg_moves (partial_schedule_ptr
> int nreg_moves = 0, i_reg_move;
> rtx prev_reg, old_reg;
> int first_move;
> + int distances[2];
> + sbitmap must_follow;
> + sbitmap distance1_uses;
> rtx set = single_set (u->insn);
>
> /* Skip instructions that do not set a register. */
> @@ -516,6 +707,7 @@ schedule_reg_moves (partial_schedule_ptr
>
> /* Compute the number of reg_moves needed for u, by looking at life
> ranges started at u (excluding self-loops). */
> + distances[0] = distances[1] = false;
> for (e = u->out; e; e = e->next_out)
> if (e->type == TRUE_DEP && e->dest != e->src)
> {
> @@ -546,6 +738,11 @@ schedule_reg_moves (partial_schedule_ptr
> gcc_assert (!autoinc_var_is_used_p (u->insn, e->dest->insn));
> }
>
> + if (nreg_moves4e)
> + {
> + gcc_assert (e->distance < 2);
> + distances[e->distance] = true;
> + }
> nreg_moves = MAX (nreg_moves, nreg_moves4e);
> }
>
> @@ -556,11 +753,10 @@ schedule_reg_moves (partial_schedule_ptr
> first_move = VEC_length (ps_reg_move_info, ps->reg_moves);
> VEC_safe_grow_cleared (ps_reg_move_info, heap, ps->reg_moves,
> first_move + nreg_moves);
> + extend_node_sched_params (ps);
>
> /* Record the moves associated with this node. */
> first_move += ps->g->num_nodes;
> - SCHED_FIRST_REG_MOVE (i) = first_move;
> - SCHED_NREG_MOVES (i) = nreg_moves;
>
> /* Generate each move. */
> old_reg = prev_reg = SET_DEST (single_set (u->insn));
> @@ -569,15 +765,18 @@ schedule_reg_moves (partial_schedule_ptr
> ps_reg_move_info *move = ps_reg_move (ps, first_move + i_reg_move);
>
> move->def = i_reg_move > 0 ? first_move + i_reg_move - 1 : i;
> - move->uses = sbitmap_alloc (g->num_nodes);
> + move->uses = sbitmap_alloc (first_move + nreg_moves);
> move->old_reg = old_reg;
> move->new_reg = gen_reg_rtx (GET_MODE (prev_reg));
> + move->num_consecutive_stages = distances[0] && distances[1] ? 2 : 1;
> move->insn = gen_move_insn (move->new_reg, copy_rtx (prev_reg));
> sbitmap_zero (move->uses);
>
> prev_reg = move->new_reg;
> }
>
> + distance1_uses = distances[1] ? sbitmap_alloc (g->num_nodes) : NULL;
> +
> /* Every use of the register defined by node may require a different
> copy of this register, depending on the time the use is scheduled.
> Record which uses require which move results. */
> @@ -601,8 +800,21 @@ schedule_reg_moves (partial_schedule_ptr
>
> move = ps_reg_move (ps, first_move + dest_copy - 1);
> SET_BIT (move->uses, e->dest->cuid);
> + if (e->distance == 1)
> + SET_BIT (distance1_uses, e->dest->cuid);
> }
> }
> +
> + must_follow = sbitmap_alloc (first_move + nreg_moves);
> + for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
> + if (!schedule_reg_move (ps, first_move + i_reg_move,
> + distance1_uses, must_follow))
> + break;
> + sbitmap_free (must_follow);
> + if (distance1_uses)
> + sbitmap_free (distance1_uses);
> + if (i_reg_move < nreg_moves)
> + return false;
> }
> return true;
> }
> @@ -626,39 +838,6 @@ apply_reg_moves (partial_schedule_ptr ps
> df_insn_rescan (ps->g->nodes[i_use].insn);
> }
> }
> -
> - FOR_EACH_VEC_ELT (ps_reg_move_info, ps->reg_moves, i, move)
> - add_insn_before (move->insn, ps_first_note (ps, move->def), NULL);
> -}
> -
> -/* Update the sched_params (time, row and stage) for node U using the II,
> - the CYCLE of U and MIN_CYCLE.
> - We're not simply taking the following
> - SCHED_STAGE (u) = CALC_STAGE_COUNT (SCHED_TIME (u), min_cycle, ii);
> - because the stages may not be aligned on cycle 0. */
> -static void
> -update_node_sched_params (int u, int ii, int cycle, int min_cycle)
> -{
> - int sc_until_cycle_zero;
> - int stage;
> -
> - SCHED_TIME (u) = cycle;
> - SCHED_ROW (u) = SMODULO (cycle, ii);
> -
> - /* The calculation of stage count is done adding the number
> - of stages before cycle zero and after cycle zero. */
> - sc_until_cycle_zero = CALC_STAGE_COUNT (-1, min_cycle, ii);
> -
> - if (SCHED_TIME (u) < 0)
> - {
> - stage = CALC_STAGE_COUNT (-1, SCHED_TIME (u), ii);
> - SCHED_STAGE (u) = sc_until_cycle_zero - stage;
> - }
> - else
> - {
> - stage = CALC_STAGE_COUNT (SCHED_TIME (u), 0, ii);
> - SCHED_STAGE (u) = sc_until_cycle_zero + stage - 1;
> - }
> }
>
> /* Bump the SCHED_TIMEs of all nodes by AMOUNT. Set the values of
> @@ -699,22 +878,6 @@ reset_sched_times (partial_schedule_ptr
> }
> }
>
> -/* Set SCHED_COLUMN of each node according to its position in PS. */
> -static void
> -set_columns_for_ps (partial_schedule_ptr ps)
> -{
> - int row;
> -
> - for (row = 0; row < ps->ii; row++)
> - {
> - ps_insn_ptr cur_insn = ps->rows[row];
> - int column = 0;
> -
> - for (; cur_insn; cur_insn = cur_insn->next_in_row)
> - SCHED_COLUMN (cur_insn->id) = column++;
> - }
> -}
> -
> /* Permute the insns according to their order in PS, from row 0 to
> row ii-1, and position them right before LAST. This schedules
> the insns of the loop kernel. */
> @@ -731,8 +894,13 @@ permute_partial_schedule (partial_schedu
> rtx insn = ps_rtl_insn (ps, ps_ij->id);
>
> if (PREV_INSN (last) != insn)
> - reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
> - PREV_INSN (last));
> + {
> + if (ps_ij->id < ps->g->num_nodes)
> + reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
> + PREV_INSN (last));
> + else
> + add_insn_before (insn, last, NULL);
> + }
> }
> }
>
> @@ -935,7 +1103,7 @@ optimize_sc (partial_schedule_ptr ps, dd
>
> static void
> duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
> - int to_stage, int for_prolog, rtx count_reg)
> + int to_stage, rtx count_reg)
> {
> int row;
> ps_insn_ptr ps_ij;
> @@ -944,7 +1112,7 @@ duplicate_insns_of_cycles (partial_sched
> for (ps_ij = ps->rows[row]; ps_ij; ps_ij = ps_ij->next_in_row)
> {
> int u = ps_ij->id;
> - int j, i_reg_moves, i_reg_move;
> + int first_u, last_u;
> rtx u_insn;
>
> /* Do not duplicate any insn which refers to count_reg as it
> @@ -958,42 +1126,15 @@ duplicate_insns_of_cycles (partial_sched
> || JUMP_P (u_insn))
> continue;
>
> - if (for_prolog)
> + first_u = SCHED_STAGE (u);
> + last_u = first_u + ps_num_consecutive_stages (ps, u) - 1;
> + if (from_stage <= last_u && to_stage >= first_u)
> {
> - /* SCHED_STAGE (u) >= from_stage == 0. Generate increasing
> - number of reg_moves starting with the second occurrence of
> - u, which is generated if its SCHED_STAGE <= to_stage. */
> - i_reg_moves = to_stage - SCHED_STAGE (u) + 1;
> - i_reg_moves = MAX (i_reg_moves, 0);
> - i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
> -
> - /* The reg_moves start from the *first* reg_move backwards. */
> - i_reg_move = SCHED_FIRST_REG_MOVE (u) + (i_reg_moves - 1);
> - }
> - else /* It's for the epilog. */
> - {
> - /* SCHED_STAGE (u) <= to_stage. Generate all reg_moves,
> - starting to decrease one stage after u no longer occurs;
> - that is, generate all reg_moves until
> - SCHED_STAGE (u) == from_stage - 1. */
> - i_reg_moves = (SCHED_NREG_MOVES (u)
> - - (from_stage - SCHED_STAGE (u) - 1));
> - i_reg_moves = MAX (i_reg_moves, 0);
> - i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
> -
> - /* The reg_moves start from the *last* reg_move forwards. */
> - i_reg_move = SCHED_FIRST_REG_MOVE (u) + (SCHED_NREG_MOVES (u) -
> 1);
> + if (u < ps->g->num_nodes)
> + duplicate_insn_chain (ps_first_note (ps, u), u_insn);
> + else
> + emit_insn (copy_rtx (PATTERN (u_insn)));
> }
> -
> - for (j = 0; j < i_reg_moves; j++)
> - {
> - ps_reg_move_info *move = ps_reg_move (ps, i_reg_move - j);
> -
> - emit_insn (copy_rtx (PATTERN (move->insn)));
> - }
> - if (SCHED_STAGE (u) >= from_stage
> - && SCHED_STAGE (u) <= to_stage)
> - duplicate_insn_chain (ps_first_note (ps, u), u_insn);
> }
> }
>
> @@ -1027,7 +1168,7 @@ generate_prolog_epilog (partial_schedule
> }
>
> for (i = 0; i < last_stage; i++)
> - duplicate_insns_of_cycles (ps, 0, i, 1, count_reg);
> + duplicate_insns_of_cycles (ps, 0, i, count_reg);
>
> /* Put the prolog on the entry edge. */
> e = loop_preheader_edge (loop);
> @@ -1039,7 +1180,7 @@ generate_prolog_epilog (partial_schedule
> start_sequence ();
>
> for (i = 0; i < last_stage; i++)
> - duplicate_insns_of_cycles (ps, i + 1, last_stage, 0, count_reg);
> + duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg);
>
> /* Put the epilogue on the exit edge. */
> gcc_assert (single_exit (loop));
> @@ -1375,8 +1516,7 @@ sms_schedule (void)
> {
> rtx head, tail;
> rtx count_reg, count_init;
> - int mii, rec_mii;
> - unsigned stage_count;
> + int mii, rec_mii, stage_count, min_cycle;
> HOST_WIDEST_INT loop_count = 0;
> bool opt_sc_p;
>
> @@ -1486,13 +1626,12 @@ sms_schedule (void)
> }
>
> gcc_assert (stage_count >= 1);
> - PS_STAGE_COUNT (ps) = stage_count;
> }
>
> /* The default value of PARAM_SMS_MIN_SC is 2 as stage count of
> 1 means that there is no interleaving between iterations thus
> we let the scheduling passes do the job in this case. */
> - if (stage_count < (unsigned) PARAM_VALUE (PARAM_SMS_MIN_SC)
> + if (stage_count < PARAM_VALUE (PARAM_SMS_MIN_SC)
> || (count_init && (loop_count <= stage_count))
> || (flag_branch_probabilities && (trip_count <= stage_count)))
> {
> @@ -1520,6 +1659,7 @@ sms_schedule (void)
>
> set_columns_for_ps (ps);
>
> + min_cycle = PS_MIN_CYCLE (ps) - SMODULO (PS_MIN_CYCLE (ps), ps->ii);
> if (!schedule_reg_moves (ps))
> {
> mii = ps->ii + 1;
> @@ -1527,6 +1667,21 @@ sms_schedule (void)
> continue;
> }
>
> + /* Moves that handle incoming values might have been added
> + to a new first stage. Bump the stage count if so.
> +
> + ??? Perhaps we could consider rotating the schedule here
> + instead? */
> + if (PS_MIN_CYCLE (ps) < min_cycle)
> + {
> + reset_sched_times (ps, 0);
> + stage_count++;
> + }
> +
> + /* The stage count should now be correct without rotation. */
> + gcc_checking_assert (stage_count == calculate_stage_count (ps, 0));
> + PS_STAGE_COUNT (ps) = stage_count;
> +
> canon_loop (loop);
>
> if (dump_file)
>