Re: Char shifts promoted to int. Why?

2006-12-21 Thread Ayal Zaks

>Something along these lines may be useful to do in the vectorizer when we
>get code like this:
>  > ((char)x) = ((char)( ((int)((char)x)) << ((int)c) ) )
>and don't feel like doing all the unpacking of chars to ints and then
>packing the ints to chars after the shift. An alternative could be to
>transform the above pattern to:
>  char_x1 = 0
>  char_x2 = char_x << c
>  char_x = ((int)c < size_of_char) ? char_x2 : char_x1
>and vectorize that (since we already know how to vectorize selects).

Alternatively, do
  char_c2 = (c < size_of_char ? c : 0)
  char_x2 = char_x << char_c2
which is like saturating the shift amount.

You could also try
  char_c2 = min(c, size_of_char)   /* And mask off a bit perhaps.  */
  char_x2 = char_x << char_c2
if you don't have general selects but do have min.

Ayal.



Fw: Scheduling an early complete loop unrolling pass?

2007-02-07 Thread Ayal Zaks

...

>Ah, right... I wonder if we can keep the loop structure in place, even
>after completely unrolling the loop  - I mean the 'struct loop' in
>'current_loops' (not the actual CFG), so that the "SLP in loops" would
have
>a chance to at least consider vectorizing this "loop".

Having a "loop" structure for a piece of CFG that is not a loop, was used
in some other compiler we worked with - the notion of 'region' was such
that it corresponded to loops, and in addition the entire function belonged
to a "universal" region (Peter - please correct if I'm wrong). But I think
you were looking for some marking of a basic block saying "this used to be
a loop but got completely unrolled". I wonder how much such a "dummy" loop
structure can really help the vectorizer, except for (convenience of)
keeping intact the driver that traverses all such structures or the hanging
of additional data off of them.

Ayal.



Re: RFC: vectorizer cost model

2007-03-08 Thread Ayal Zaks

> > > "Linthicum, Tony" <[EMAIL PROTECTED]> writes:
...
> > >   * What is the best way to access target level cost information?
> >
> > I'm sure you know that the loop code does this by generating RTL and
> > asking for the cost of it (computation_cost in tree-ssa-loop-ivopts.c).
>
> Which should be removed (I have some patches for that, but I somehow
> forgot on them because of other issues).
>
> Zdenek
...
> > > ... determine the minimum trip count needed for the vector loop to be
> > > profitable.

So, if we know the actual trip count to be more than this minimum trip
count (mtc), we'll abandon the effort to vectorize. And if we don't know
the actual trip count, we can ask at runtime if it is larger than mtc. We
already ask at runtime if the actual trip count is larger than the
vectorization factor, if we don't know the actual trip count at compile
time. In such (predominant?) cases the vectorized loop gets versioned,
keeping its original scalar version and introducing another, vector
version, with this runtime guard choosing between the two. (This is also
used to check if potentially aliased pointers overlap, or to check
alignment constraints.) If we version the loop, we can update the threshold
of this guard at a later stage, when the cost of each version becomes more
apparent (RTL).

So if we don't want to generate RTL just for the sake of computing its
cost, we can wait until it is eventually generated (and optimized) and plug
its cost back where we need it.

One disadvantage is that we're not saving compile-time (and probably
produce worse code) if the mtc turns out to be greater than a known trip
count, or greater than any conceivable trip count. There are ways to try
and cope with this disadvantage, and there are additional disadvantages...

BTW, we had a simple cost model for the modulo scheduler, which turned out
to be overly conservative (fourth in-progress item on
http://gcc.gnu.org/wiki/SwingModuloScheduling).

Ayal.



Re: Improvements of the haifa scheduler

2007-03-09 Thread Ayal Zaks
Maxim Kuvyrkov <[EMAIL PROTECTED]> wrote on 04/03/2007 11:53:47:

> Hi.
>
> I want to share some of my thoughts and doings on improving / cleaning
> up current GCC instruction scheduler (Haifa) - most of them are just
> small obvious improvements.
>
> I have semi-ready patches for about a half of them and would appreciate
> any early suggestion or comments on the following draft plan:
>
> 1. Remove compute_forward_dependencies ().  [Done]
>
> Since new representation of dependencies lists was checked in, we don't
> longer need to compute forward dependencies separately.  It would be
> natural to add forward links at the same time as we generate backward
ones.
>
> 2. Use alloc_pools instead of obstacks for dep_nodes and deps_lists.
> [In progress]
>
> As pointed out by Jan Hubicka scheduler peaks +100M on a testcase for
> PR28071 after my patch for new dependencies lists was checked in.
> Though alloc_pools should have been my first choice while writing that
> patch, I decided to mimic as close as possible the original rtx
> instruction lists with their scheme of deallocation at the very end.  So
> the next step would be to define proper lifetime of dependency lists
> and use alloc_pools to reuse nodes and lists from the previous regions.

It might be possible to shrink such large DDG's by removing edges that are
redundant wrt schduling, due to transitivity. Not sure how much we'll gain
by such a (non-trivial, dangerous) effort.

>
> Which brings us to ...
>
> 3. Define clear interface for manipulating dependencies.  [In progress]
>
> This one popped up when I began to debug <2> and understood that the
> scheduler uses and changes dependencies lists the way it shouldn't.
> Lists are being copied, edited and deleted directly without interaction
> with sched-deps.c .  What the scheduler really needs is the following
> set of primitives:
>o FOR_EACH_DEP (insn, which_list, iterator, dep) - walk through
> insn's which_list (one of {backward, resolved_backward, forward,
> resolved_forward}) and provide the user with the dep.  Ayal Zaks
> suggested this type of macro weeks ago but at that time I didn't agree
> with him.
>o dep_t find_dep_between (producer, consumer) - find dependency
> between two instructions.  Currently we walk through the list looking
> for what we need.  A better way would be to first check dependency
> caches and then, if we can't determine that there is no dependency, walk
> through the shorter list given the choice of two: producer's forward
> list and consumer's backward list.
>o void add_dep (dep_t) - Add a dependency.
>o void remove_dep (iterator) - Remove dependency pointed out by
iterator.
>o void resolve_dep (iterator) - Resolve dependency pointed out by
> iterator.
>o int get_list_size (insn, which_list) - Get the size of the insn's
> which_list.
>o bool list_has_only_speculative_deps (insn, which_list) - Return
> true if all insn's dependencies can be overcome with some sort of
> speculation.

It would probably be cleaner to somehow seperate or designate the
speculative part from the core DDG part (in terms of data and interfaces).

>o void {create, delete}_dependency_lists (insn) - Create / delete
> dependency lists for insn.
>
> As you can see, the scheduler doesn't need to know the internal
> representation of the deps_list / dep_node.
>
> 4. Support speculative loads into subregs.  [Planned]
>
> As noted by Jim Wilson current support for ia64 speculation doesn't
> handle subregs though that would be easy to fix.
>
> 5. Make sched-deps.c mark only those dependencies as speculative which
> can actually be overcame with speculation types currently in use.
[Planned]
>
> At the moment we first generate speculative dependencies and only at the
> moment of adding instruction to the ready list we check if we can (or it
> worth to) overcome every of its dependencies.
>
> 6. Make ds_t a structure.  [Planned]
>
> ds_t is type for representing status of a dependency.  It contains
> information about types of the dependency (true, output, anti) and
> probabilities of speculation success (begin_data, be_in_data,
> begin_control, be_in_control) - that makes three bits and for integers
> coded in a single int.  Historical reasons forced this inelegant
> approach but now the reasons are inexistent and the problem can be
> solved in a natural way.
>
> 7. Use cse_lib in sched-rgn.c .  [In progress]
>
> At the moment cse_lib works to improve alias analysis only during
> sched-ebb scheduling.  It is trivial that we can also enable it when
> scheduling single block regions in sched-rgn.  The patch for this is
> a one-liner which w

Re: Ayal Zaks appointed Modulo Scheduler maintainer

2007-03-23 Thread Ayal Zaks
David Edelsohn <[EMAIL PROTECTED]> wrote on 23/03/2007 19:04:57:

>I am pleased to announce that the GCC Steering Committee has
> appointed Ayal Zaks as Modulo Scheduler maintainer.
>
>Please join me in congratulating Ayal on his new role.
> Ayal, please update your listings in the MAINTAINERS file.

Thanks!  I will look into the modulo-scheduler patches backlog shortly and
I've updated the MAINTAINERS file by the attached patch.

Ayal.


Index: ChangeLog
===
--- ChangeLog   (revision 123172)
+++ ChangeLog   (working copy)
@@ -1,3 +1,7 @@
+2007-03-24  Ayal Zaks  <[EMAIL PROTECTED]>
+
+   * MAINTAINERS (Modulo Scheduler): Add myself.
+
 2007-03-23  Brooks Moses  <[EMAIL PROTECTED]>

* MAINTAINERS (fortran 95 front end): Add myself.
Index: MAINTAINERS
===
--- MAINTAINERS (revision 117617)
+++ MAINTAINERS (working copy)
@@ -158,6 +158,7 @@
 scheduler (+ haifa)Michael Meissner[EMAIL PROTECTED]
 scheduler (+ haifa)Jeff Law[EMAIL PROTECTED]
 scheduler (+ haifa)Vladimir Makarov[EMAIL PROTECTED]
+modulo-scheduler   Ayal Zaks   [EMAIL PROTECTED]
 reorg  Jeff Law[EMAIL PROTECTED]
 caller-save.c  Jeff Law[EMAIL PROTECTED]
 callgraph  Jan Hubicka [EMAIL PROTECTED]



Re: SoC Project: Propagating array data dependencies from Tree-SSA to RTL

2007-03-25 Thread Ayal Zaks
"Daniel Berlin" <[EMAIL PROTECTED]> wrote on 25/03/2007 05:19:41:

> On 3/23/07, Alexander Monakov <[EMAIL PROTECTED]> wrote:
> > Hello,
> >
> >
> > I would be pleased to see Ayal Zaks as my mentor, because proposed
> > improvement is primarily targeted as modulo scheduling improvement. In
> > case this is not possible, I will seek guidance from Maxim Kuvyrkov.
> >
> Ayal has not signed up to be a mentor (as of yet). If he doesn't, i'd
> be happy to mentor you here, since i wrote part of tree-data-ref.c

Sorry, I fear I may have too little time to devote to this; plus, it would
be very useful to start with a good understanding of tree-data-ref from
which to start propagating the dep info. Vladimir Yanovsky and I will be
able to help when it comes to what/how to feed the modulo scheduler.

Thanks, Ayal.



Fw: Store scheduling with DFA scheduler

2005-05-01 Thread Ayal Zaks




As you noted, when the scheduler decides between stores and adds it always
prefers the adds (first at t = 5), due to its critical path heuristic. In
Jon's example, stores seem "costly" as one cannot issue a load or store
immediately following a store. Perhaps the scheduler could take the
(resource-related) "cost" of candidates into consideration (in addition to
their latency-related critical path) when making such decisions. Still
heuristically speaking, of-course.

Ayal.



Re: Software pipelining capabilities

2005-06-15 Thread Ayal Zaks




> Vasanth <[EMAIL PROTECTED]>

> I am using powerpc-eabi-gcc (3.4.1) and trying to retarget it for a
> fully pipelined FPU. I have a DFA model for the FPU. I am looking at
> the code produced for a simple FIR algorithm (a loop iterating over an
> array, with a multiply-add operation per iteration). (I am not using
> the fused-madd)
>
> for (i = 0; i < 64; i++)
>  accum = z[i] * h[i];
>
> I have the FIR loop partially unrolled, yet am not seeing the multiply
> from say, iteration i+1, overlapping with the multiply from iteration
> i. From the scheduling dumps, I do see that the compiler knows that
> each use of the multiply is incurring the full latency of the multiply
> instead of having reduced latency by pipelining in software. The adds
> are also completely linked by data flow and the compiler does not seem
> to be using temporary registers to be able to exploit executing some
> of the adds in parallel. Hence, each add is stalled on the previous
> add.
>
> fadds   f5,f0,f8
> fadds   f4,f5,f6
> fadds   f2,f4,f11
> fadds   f1,f2,f3
> fadds   f11,f1,f13
>
> The register pressure is not very high. Registers f15-f31 are not used at
all.

To break the linkage between the adds, try to keep the original loop
(instead of partially unrolling it yourself) and use -funroll-loops
-fvariable-expansion-in-unroller --param
max-variable-expansions-in-unroller=8 (or some other number greater than 1
but small enough to avoid spills).

(see http://gcc.gnu.org/ml/gcc/2004-09/msg01554.html)

This too was introduced in GCC 4.0.
Ayal.


>
> My question is, am I expecting the wrong version of GCC to be doing
> this. I saw the following thread about SMS.
>
> http://gcc.gnu.org/ml/gcc/2003-09/msg00954.html
>
> that seems relevant. Would GCC 4.x be a better version for my
> requirement? If not, any ideas would be greatly appreciated.
>
> thanks in advance,
> Vasanth



Re: Question about SMS scheduling windows

2011-07-27 Thread Ayal Zaks
(sorry for replicated submissions, had to convert to plain text)

>2011/7/27 Revital1 Eres 
>
>Hello Richard,
>
>
>> I ask because in the final range:
>>
>>   start = early_start;
>>   end = MIN (end, early_start + ii);
>>   /* Schedule the node close to it's predecessors.  */
>>   step = 1;
>>
>> END is an exclusive bound.  It seems like we might be double-counting
here,
>> and effectively limiting the schedule to SCHED_TIME (v_node) + ii - 2.
>
>
>Yes, I think it indeed should be fixed. Thanks for reporting on this.
>
>Revital

Agreed;

  if (e->data_type == MEM_DEP)
end = MIN (end, SCHED_TIME (v_node) + ii - 1);

should be replaced with

  if (e->data_type == MEM_DEP)
end = MIN (end, p_st + ii);

also for the (3rd) case when there are both previously-scheduled
predessors and previously-scheduled successors. The range is inclusive
of start and exclusive of end: for (c = start; c != end; c += step)...


>This doesn't seem to be in the paper, and the comment suggests
>"count_succs > count_preds" rather than "count_succs >= count_preds".
>Is the ">=" vs ">" important?

I think not: in both cases you'll be trying to minimize the same
number of live ranges. But indeed it's good to be consistent with the
comment.

Thanks,
Ayal.


Re: Question about SMS scheduling windows

2011-08-04 Thread Ayal Zaks
Hi Richard,

2011/8/4 Richard Sandiford 
>
> Hi Ayal,
>
> Thanks to you and Revital for the replies.  The reason I asked is that
> I wanted to rewrite gen_sched_window so that it has only one loop over
> the PSPs and one loop over the PSSs.


This rewrite makes perfect sense regardless of any follow-up patches,
as it clarifies and reuses code.

>
> I have a follow-up patch to use
> iv analysis to reduce the number of memory dependencies (or at least
> increase the distance between them), and it was easier to do that
> after this change.


Reducing memory dependencies is definitely important.

>
> Ayal Zaks  writes:
> >>> I ask because in the final range:
> >>>
> >>>       start = early_start;
> >>>       end = MIN (end, early_start + ii);
> >>>       /* Schedule the node close to it's predecessors.  */
> >>>       step = 1;
> >>>
> >>> END is an exclusive bound.  It seems like we might be double-counting
> > here,
> >>> and effectively limiting the schedule to SCHED_TIME (v_node) + ii - 2.
> >>
> >>
> >>Yes, I think it indeed should be fixed. Thanks for reporting on this.
> >>
> >>Revital
> >
> > Agreed;
> >
> >                       if (e->data_type == MEM_DEP)
> >                                 end = MIN (end, SCHED_TIME (v_node) + ii - 
> > 1);
> >
> > should be replaced with
> >
> >                       if (e->data_type == MEM_DEP)
> >                                 end = MIN (end, p_st + ii);
> >
> > also for the (3rd) case when there are both previously-scheduled
> > predessors and previously-scheduled successors. The range is inclusive
> > of start and exclusive of end: for (c = start; c != end; c += step)...
>
> OK.  For the follow-on iv patch, it seemed easier to keep both bounds
> inclusive for the loop, then make the "end" exclusive when setting the
> out parameters.  (Note that there shouldn't be any problem with overflow
> when making the bound exclusive, because the size of the range has been
> restricted to "ii" by that stage.  The follow-on patch does use
> saturating maths for all ops though.)


Sure, no problem having "end" inclusive, as it simplifies things. We
can even keep it inclusive all the way through, iterating over: for (c
= start; c != end+step; c += step)...
>
> >>This doesn't seem to be in the paper, and the comment suggests
> >>"count_succs > count_preds" rather than "count_succs >= count_preds".
> >>Is the ">=" vs ">" important?
> >
> > I think not: in both cases you'll be trying to minimize the same
> > number of live ranges. But indeed it's good to be consistent with the
> > comment.
>
> OK.  For no particular reason other than cowardice, I ended up keeping
> this as:
>
> !   if (count_succs && count_succs >= count_preds)
>
> The reason for asking was that:
>
> !   if (count_succs > count_preds)
>
> seemed more natural, and would match the existing comment.  I'm happy
> to test that instead if you prefer.
>
I wouldn't worry about this tie breaker, unless there's a reason (in
which case the reason should hopefully provide a secondary criteria).

>
> I don't have powerpc hardware that I can do meaningful performance
> testing on, but I did run it through a Popular* Embedded Benchmark
> on an ARM Cortex-A8 board with -O3 -fmodulo-sched
> -fmodulo-sched-allow-regmoves.  There were no changes.  (And this is
> a benchmark that does benefit from modulo scheduling, in some cases
> by a significant amount.)
>
Ok, the patch should be neutral performance-wise. One could check btw
if SMS produces exactly the same output in both cases, even w/o a
powerpc (or any other) HW.

>
> Bootstrapped & regression-tested on powerpc-ibm-aix5.3 with the
> additional patch:
>
> Index: gcc/opts.c
> ===
> --- gcc/opts.c  2011-08-03 10:56:50.0 +0100
> +++ gcc/opts.c  2011-08-03 10:56:50.0 +0100
> @@ -449,6 +449,8 @@ static const struct default_options defa
>     { OPT_LEVELS_1_PLUS, OPT_ftree_ch, NULL, 1 },
>     { OPT_LEVELS_1_PLUS, OPT_fcombine_stack_adjustments, NULL, 1 },
>     { OPT_LEVELS_1_PLUS, OPT_fcompare_elim, NULL, 1 },
> +    { OPT_LEVELS_1_PLUS, OPT_fmodulo_sched, NULL, 1 },
> +    { OPT_LEVELS_1_PLUS, OPT_fmodulo_sched_allow_regmoves, NULL, 1 },
>
>     /* -O2 optimizations.  */
>     { OPT_LEVELS_2_PLUS, OPT_finline_small_functions, NULL, 1 },
>
> applied for both the "

Re: Question about SMS scheduling windows

2011-08-08 Thread Ayal Zaks
2011/8/8 Richard Sandiford 
>
> Ayal Zaks  writes:
> >> OK.  For the follow-on iv patch, it seemed easier to keep both bounds
> >> inclusive for the loop, then make the "end" exclusive when setting the
> >> out parameters.  (Note that there shouldn't be any problem with overflow
> >> when making the bound exclusive, because the size of the range has been
> >> restricted to "ii" by that stage.  The follow-on patch does use
> >> saturating maths for all ops though.)
> >
> > Sure, no problem having "end" inclusive, as it simplifies things. We
> > can even keep it inclusive all the way through, iterating over: for (c
> > = start; c != end+step; c += step)...
>
> Yeah, I'd prefer that too.  I might do it once Revital's and
> Roman's changes are in.
>
> >> I don't have powerpc hardware that I can do meaningful performance
> >> testing on, but I did run it through a Popular* Embedded Benchmark
> >> on an ARM Cortex-A8 board with -O3 -fmodulo-sched
> >> -fmodulo-sched-allow-regmoves.  There were no changes.  (And this is
> >> a benchmark that does benefit from modulo scheduling, in some cases
> >> by a significant amount.)
> >>
> > Ok, the patch should be neutral performance-wise. One could check btw
> > if SMS produces exactly the same output in both cases, even w/o a
> > powerpc (or any other) HW.
>
> OK.  The patch does change the output a little because of the MEM_DEP
> range thing.


Ahh, right.

>
>  To compensate for that, I tried compiling libav on ARM with:
>
> Index: gcc/modulo-sched.c
> ===
> --- gcc/modulo-sched.c  2011-08-05 11:23:16.0 +0100
> +++ gcc/modulo-sched.c  2011-08-05 11:27:57.0 +0100
> @@ -1680,7 +1680,7 @@ get_sched_window (partial_schedule_ptr p
>                          p_st, early_start, e->latency);
>
>              if (e->data_type == MEM_DEP)
> -               end = MIN (end, SCHED_TIME (v_node) + ii - 1);
> +               end = MIN (end, SCHED_TIME (v_node) + ii);
>            }
>          else if (dump_file)
>             fprintf (dump_file, "the node is not scheduled\n");
> @@ -1729,7 +1729,7 @@ get_sched_window (partial_schedule_ptr p
>                          s_st, late_start, e->latency);
>
>              if (e->data_type == MEM_DEP)
> -               end = MAX (end, SCHED_TIME (v_node) - ii + 1);
> +               end = MAX (end, SCHED_TIME (v_node) - ii);
>              if (dump_file)
>                  fprintf (dump_file, "end = %d\n", end);
>
> @@ -1791,7 +1791,7 @@ get_sched_window (partial_schedule_ptr p
>                 count_preds++;
>
>              if (e->data_type == MEM_DEP)
> -               end = MIN (end, SCHED_TIME (v_node) + ii - 1);
> +               end = MIN (end, SCHED_TIME (v_node) + ii);
>            }
>           else if (dump_file)
>             fprintf (dump_file, "the node is not scheduled\n");
>
> applied, then tried the same thing with the revised patch below.
> The output was the same.


ok, that's a good sanity check.

>
> (FWIW, libav did show up extra differences when using the patch
> that I'd originally submitted.  They were due to the count_preds
> and count_succs thing that you picked up in your review.)


(These differences had no noticable consequences performance-wise, right?)

>
> >> Bootstrapped & regression-tested on powerpc-ibm-aix5.3 with the
> >> additional patch:
> >>
> >> Index: gcc/opts.c
> >> ===
> >> --- gcc/opts.c  2011-08-03 10:56:50.0 +0100
> >> +++ gcc/opts.c  2011-08-03 10:56:50.0 +0100
> >> @@ -449,6 +449,8 @@ static const struct default_options defa
> >>     { OPT_LEVELS_1_PLUS, OPT_ftree_ch, NULL, 1 },
> >>     { OPT_LEVELS_1_PLUS, OPT_fcombine_stack_adjustments, NULL, 1 },
> >>     { OPT_LEVELS_1_PLUS, OPT_fcompare_elim, NULL, 1 },
> >> +    { OPT_LEVELS_1_PLUS, OPT_fmodulo_sched, NULL, 1 },
> >> +    { OPT_LEVELS_1_PLUS, OPT_fmodulo_sched_allow_regmoves, NULL, 1 },
> >>
> >>     /* -O2 optimizations.  */
> >>     { OPT_LEVELS_2_PLUS, OPT_finline_small_functions, NULL, 1 },
> >>
> >> applied for both the "before" and "after" runs.  OK to install?
> >
> >
> > Yes, with just a couple of minor non-mandatory comments:
> > 1. Setting "count_preds = psp_not_empty;" and "count_succs =
> > pss_not_empty;" suggest

Re: A question about sched_analyze_insn in sched-deps.c

2011-08-15 Thread Ayal Zaks
>AFAIK SMS will not do speculative memory access.

Right, SMS does no speculative memory access. Though that might not be
a bad idea...
Ayal.


2011/8/11 Revital Eres 
>
> Hello,
>
> >> I appriciate explanation regarding the following piece of code in
> >> sched_analyze_insn function (sched-deps.c): When handling jump instruction
> >> dependence edges are created between the jump instruction and memory
> >> writes and volatile reads and I'm not quite sure the reason why.
> >
> > Jump instructions can be conditional.  Note the check for whether the
> > next instruction is a barrier.
>
> Thanks for the answer. I'm asking that in the context of SMS --- I'm
> not sure if this dependence is needed when SMS is applied. AFAIK SMS
> will not do speculative memory access.  If that's indeed the case I'll
> submit the following patch.
>
> Thanks,
> Revital
>
> Index: sched-deps.c
> ===
> --- sched-deps.c        (revision 177556)
> +++ sched-deps.c        (working copy)
> @@ -2777,32 +2777,36 @@ sched_analyze_insn (struct deps_desc *de
>             }
>
>          /* All memory writes and volatile reads must happen before the
> -            jump.  Non-volatile reads must happen before the jump iff
> -            the result is needed by the above register used mask.  */
> +            jump unless the analysis is done for the SMS pass.
> +            Non-volatile reads must happen before the jump iff the
> +            result is needed by the above register used mask.  */
>
> -         pending = deps->pending_write_insns;
> -         pending_mem = deps->pending_write_mems;
> -         while (pending)
> +         if (common_sched_info->sched_pass_id != SCHED_SMS_PASS)
>            {
> -             if (! sched_insns_conditions_mutex_p (insn, XEXP (pending, 0)))
> -               add_dependence (insn, XEXP (pending, 0), REG_DEP_OUTPUT);
> -             pending = XEXP (pending, 1);
> -             pending_mem = XEXP (pending_mem, 1);
> -           }
> -
> -         pending = deps->pending_read_insns;
> -         pending_mem = deps->pending_read_mems;
> -         while (pending)
> -           {
> -             if (MEM_VOLATILE_P (XEXP (pending_mem, 0))
> -                 && ! sched_insns_conditions_mutex_p (insn, XEXP (pending, 
> 0)))
> -               add_dependence (insn, XEXP (pending, 0), REG_DEP_OUTPUT);
> -             pending = XEXP (pending, 1);
> -             pending_mem = XEXP (pending_mem, 1);
> +             pending = deps->pending_write_insns;
> +             pending_mem = deps->pending_write_mems;
> +             while (pending)
> +               {
> +                 if (! sched_insns_conditions_mutex_p (insn, XEXP (pending, 
> 0)))
> +                   add_dependence (insn, XEXP (pending, 0), REG_DEP_OUTPUT);
> +                 pending = XEXP (pending, 1);
> +                 pending_mem = XEXP (pending_mem, 1);
> +               }
> +
> +             pending = deps->pending_read_insns;
> +             pending_mem = deps->pending_read_mems;
> +             while (pending)
> +               {
> +                 if (MEM_VOLATILE_P (XEXP (pending_mem, 0))
> +                     && ! sched_insns_conditions_mutex_p (insn, XEXP 
> (pending, 0)))
> +                   add_dependence (insn, XEXP (pending, 0), REG_DEP_OUTPUT);
> +                 pending = XEXP (pending, 1);
> +                 pending_mem = XEXP (pending_mem, 1);
> +               }
> +
> +             add_dependence_list (insn, deps->last_pending_memory_flush, 1,
> +                                  REG_DEP_ANTI);
>            }
> -
> -         add_dependence_list (insn, deps->last_pending_memory_flush, 1,
> -                              REG_DEP_ANTI);
>        }
>     }


Re: Solve transitive closure issue in modulo scheduling

2009-02-01 Thread Ayal Zaks
"Bingfeng Mei"  wrote on 30/01/2009 14:44:01:

> Hello,
> I try to make modulo scheduling work more efficiently for our VLIW
target. I
> found one serious issue that prevents current SMS algorithm from
achieving
> high IPC is so-called "transitive closure" problem, where scheduling
window is
> only calculated using direct predecessors and successors. Because SMS is
not
> an iterative algorithm, this may cause failures in finding a valid
schedule.

Agreed.

> Without splitting rows, some simple loops just cannot be scheduled not
matter
> how big the II is. With splitting rows, schedule can be found, but only
at
> bigger II.

It may happen that even splitting rows will not help, e.g. when we
repeatedly end up with a node having negative sched window.

> GCC wiki (http://gcc.gnu.org/wiki/SwingModuloScheduling) lists this
> as a TODO. Is there any work going on about this issue

No, not to my knowledge. We had some testcase where this showed up, hence
its appearance in the TODO, but afaicr some change caused it to disappear.

> (the last wiki update
> was one year ago)? If no one is working on it, I plan to do it. My idea
is to
> use the MinDist algorithm described in B. Rau's classic paper "iterative
> modulo
scheduling" (http://www.hpl.hp.com/techreports/94/HPL-94-115.html). The
> same algorithm can also be used to compute better RecMII. The biggest
concern
> is complexity of computing MinDist matrix, which is O(N^3). N is number
of
> nodes in the loop. I remember somewhere GCC coding guide says "never
write
> quadratic algorithm" :-) Is this an absolute requirement? If yes, I will
keep
> it as our target-specific code (we are less concerned about compilation
time).
> Otherwise, I will try to make it more generic to see if it can make into
> mainline in 4.5. Any comments?
>

The problem appears only when the DDG is cyclic, and for every cycle in the
DDG, the problem may arise when trying to schedule the last node of the
cycle, which has both predecessors and successors already scheduled. So you
might try to connect only each such predecessor to every such successor
with a transitive arc, to ensure that this last node will have a non-empty
scheduling window. But this might not suffice; you may eventually need to
wire (nearly) every pair of nodes in the strongly connected component. If
this happens, you'd be better off with a dense graph representation
(incidence matrix) than the current sparse one (adjaceny lists).

An example would help see things more clearly. If you have a (small :) DDG
demonstrating the need for transitive arcs, I'd be happy to have a look and
advise what should be done.

Ayal.

> Cheers,
> Bingfeng Mei
>
> Broadcom UK
>
>



RE: Solve transitive closure issue in modulo scheduling

2009-02-17 Thread Ayal Zaks
"Bingfeng Mei"  wrote on 05/02/2009 12:45:20:

> Ayal,
> OOPs, your mail skipped my inbox and I missed it for several days.
>

I'm chasing after my mail as well..

> Using testsuite/gcc.dg/sms-6.c as an example and compiling it for
PowerPC,
> node 18 (see attachment) is in a SCC and cannot be scheduled until
spliting
> twice. The MII = 20 and the schedule can only  be found at II = 24.

Yes, I see. This example raises a couple of issues:

o The first row split (from II=20 to II=21) is miscalculated; it should be
row 20=0 instead of 19. Splitting row 19 cannot help schedule node 18, and
indeed we immediately split another row. We're now checking a small patch
to fix this, which should save one cycle of II in the above example.

o The swinging scheduling procedure tends to create sets of densly
scheduled instructions with holes between them. For example, the set of
nodes 3, 7, 11, 15 first scheduled adjacently before (without leaving room
for) node 18 (in cycles 6 (=26 mod 20), 25, 24, 23 respectively). So it may
be possible to delete some empty row from such a hole (suffices to check a
single row from each hole), after we insert an empty row to accommodate an
instruction, thereby maintaining the II. This should save at-least one
additional cycle of II in the above example.

o Transitive closure can indeed prevent the construction of infeasible
partial schedules. Note that you'd be better off with a dense incidence
representation for the DDG, as it will become a complete graph. The
transitive DDG could be pruned according to the scheduling order, as we're
only interested in adding edge u -> v if there is some w on a path u ~> w
~> v where w follows both u and v in the to-be-scheduled order; but for
that you probably need to have transitive edges u -> w and w -> v ...
Plus, leaving a single cycle for w to be scheduled between u and v may not
be enough, as other instructions or conflicts may prevent w from eventually
being scheduled in this cycle. So the split (&remove) row mechanism may
still be useful.

o Another preventive (rather than corrective) action can be to prioritorize
register dependent nodes over memory dependent ones. Note that in the
example above, the 4 predecessors (7,11,15,18) of node 3 are equally
critical, and we're missing some tie-breaking criteria. It may be better to
prioritize node 18 due to its register dependence (the general motivation
being that near-by nodes imply shorter register live-ranges) over the other
nodes of output memory dependence. It's not easy however to extend the
current infrastructure to consider such criteria.

> On our 2-
> issue VLIW, the MII=10 and the valid schedule can only be found at II =
14. It
> is not great since we want to maximize performance.
>
> I had experience (in development of another compiler) on this issue by
> constructing the MinDist matrix and using it to calculate schedule window
for
> each instruction. However, speed was not a real concern then. Do you know

> better algorithm (better than O(N^3))?

No, I don't. As mentioned above, we might be able to save something by not
calculating the complete closure.

> Or do you think it is not so crtical
> here? After all, not many loops are candidates for software pipelining.
Thanks.
>

I would suggest to check the actual compile-time and memory usage, and try
to reduce them later if needed. Modulo-scheduling is not a compile-time
cheap optimization in general.
Ayal.



> Cheers,
> Bingfeng
>
> > -Original Message-
> > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On
> > Behalf Of Ayal Zaks
> > Sent: 01 February 2009 23:18
> > To: Bingfeng Mei
> > Cc: Adrian Ashley; gcc@gcc.gnu.org
> > Subject: Re: Solve transitive closure issue in modulo scheduling
> >
> > "Bingfeng Mei"  wrote on 30/01/2009 14:44:01:
> >
> > > Hello,
> > > I try to make modulo scheduling work more efficiently for our VLIW
> > target. I
> > > found one serious issue that prevents current SMS algorithm from
> > achieving
> > > high IPC is so-called "transitive closure" problem, where scheduling
> > window is
> > > only calculated using direct predecessors and successors.
> > Because SMS is
> > not
> > > an iterative algorithm, this may cause failures in finding a valid
> > schedule.
> >
> > Agreed.
> >
> > > Without splitting rows, some simple loops just cannot be
> > scheduled not
> > matter
> > > how big the II is. With splitting rows, schedule can be
> > found, but only
> > at
> > > bigger II.
> >
> > It may happen that even splitting rows will not help, e.g. when we
> > repeatedly end up with a node having negative sched wind

Re: [PATCH] SMS - Pass the actual schedulable rows to compute_split_row

2009-03-13 Thread Ayal Zaks
Revital1 Eres/Haifa/IBM wrote on 13/03/2009 07:50:08:

> Hello,
>
> > > Using testsuite/gcc.dg/sms-6.c as an example and compiling it for
> > PowerPC,
> > > node 18 (see attachment) is in a SCC and cannot be scheduled until
> > spliting
> > > twice. The MII = 20 and the schedule can only  be found at II = 24.
> >
> > Yes, I see. This example raises a couple of issues:
> >
> > o The first row split (from II=20 to II=21) is miscalculated; it should
be
> > row 20=0 instead of 19. Splitting row 19 cannot help schedule node 18,
and
> > indeed we immediately split another row. We're now checking a small
patch
> > to fix this, which should save one cycle of II in the above example.
>
> Here is the patch, on behalf of Ayal.
> Passed bootstrap + regtest with SMS flags on ppc64 and bootstrap +
> regtest x86.
>
> I'll commit it later today to trunk if that's OK.
>

OK for trunk, when it's reopened for general patches (it's currently in
stage 4, allowing only regression fixes; this is a missed-optimization fix,
as splitting a row outside the window should not cause any problems except
for inefficiency). Please add a short comment for this short fix,
documenting the fact that the scheduling window is exclusive of 'end'
whereas compute_split_window() expects an inclusive, ordered range.

Bingfeng, you're welcome to try it out.

Thanks, Ayal.


> Thanks,
> Revital
>
> * modulo-sched.c (sms_schedule_by_order): Pass the actual
> schedulable rows to compute_split_row.
>
> [attachment "patch_sms_12_3.txt" deleted by Ayal Zaks/Haifa/IBM]



Re: Line insn notes in modulo-sched

2006-03-14 Thread Ayal Zaks
Steven Bosscher <[EMAIL PROTECTED]> wrote on 14/03/2006 01:32:09:

> Hi Ayal,
>
> The SMS implementation in GCC, in modulo-sched.c, uses line notes
> to find insn locations, see find_line_note.  Why are you using
> line notes instead of insn locators?  Line notes are on the list
> of Things That Should Not Be, and insn locators replace them.  Is
> there a reason for modulo-sched to rely on loop notes, or is this
> just an oversight?

The idea was to reorder/duplicate each insn together with its line note (or
rather, reorder/duplicate an interval of insns starting from
node->first_note until the node->insn itself). Compared to haifa-sched's
scheme of saving the notes, reordering the insns, then restoring the notes.
The line notes are not used to find insn locations -- we carry them along
because we had to. If we no longer need to worry about keeping each line
note adjacent to its insn during scheduling, that would simplify things.
Please advise.
Line notes are also used to dump SMS statistics, but for that any locator
will do.
Not sure about "loop notes"?

Ayal.

>
> Gr.
> Steven



Re: Fw: Status of SEE and Autovectorization patches?

2006-05-03 Thread Ayal Zaks
> Roger Sayle <[EMAIL PROTECTED]>
> Sent by: [EMAIL PROTECTED]
>
> 03/05/2006 05:03 PM
>
> To
>
> Mark Mitchell <[EMAIL PROTECTED]>
>
> cc
>
> Richard Henderson <[EMAIL PROTECTED]>, gcc mailing list ,
Leehod
> Baruch <[EMAIL PROTECTED]>, Mircea Namolaru/Haifa/[EMAIL PROTECTED],
> <[EMAIL PROTECTED]>
>
> Subject
>
> Re: Status of SEE and Autovectorization patches?
>
>
> Hi Mark,
>
> On Tue, 2 May 2006, Mark Mitchell wrote:
> > Roger, I know that you reviewed the SEE patches.  Is there anything
> > more than needs to be done to get them committed, in your view?
>
> As far as I'm aware, we're still just waiting for the Haifa folks to
> commit them to mainline.  There are a number of harmless code generation
> changes on some platforms, but as far as we know no known performance
> regressions.  And the Intel folks are keen to submit follow-up patches
> for x86_64 to extend/enhance this framework once its in the tree.
> And importantly there's a single command-line option/variable that
> can enable/disable this pass should there be any adverse events.
>
> However, my growing concern at the moment is the response time of
> the Haifa folks.  If it's going to go into the tree in stage 3, I'd
> feel more comfortable that the patch submitters/contributors looked
> like they could respond to queries/issues with 24 or 48 hours.  The
> current prolonged silence/inaction doesn't inspire great confidence.
>
> Perhaps you could make the executive decision, that if it isn't in
> the tree by June (or some date of your choosing), that it'll just
> have to wait for 4.3?  My apologies for taking so long to review
> these changes, and I and the GCC maintainership should take some of
> the responsibility for the delays and current inconvenience.
>
>
> It looks from the e-mail addresses that possibly the primary developer
> of SEE, Leehod, is no longer at IBM but has returned to academia, with
> Mircea pinging the patches.  This could explain things, but is pure
> speculation.  Maybe someone could volunteer to update the posted patches
> for mainline and make the minor style corrections requested in the final
> review/approval?
>
> I've Cc'd Leehod and Mircea on this e-mail, so perhaps they'll be
> able to better explain the current status themselves.

Until they do (we're having our national Memorial-Independence ceremonies
now, ending tomorrow), I can update that a patch has been prepared and
tested, and is expected to be committed very shortly. Leehod is indeed the
primary developer of SEE and has left IBM to continue his studies (we
couldn't hold on to him until the patch was reviewed ;-), but was inspired
by and worked very closely on it with Mircea who is taking care of
following-up and submitting the patch. My apologies for our prolonged
response time, some of which is attributed to preparations for the upcoming
gcc summit and other matters that have been resolved by now.

Ayal.

>
> Roger
> --
>



Re: Fw: GCC 4.2 Status Report (2006-06-04)

2006-06-06 Thread Ayal Zaks

> This status report has been a long time coming, for which I apologize.
>
> As fwprop is no longer on the table for 4.2, and as the vectorizer
> improvements seem to have stalled due to a combination of lack of review
> and Dorit's leave,

That is unfortunate. Dorit did make a sincere effort to prepare her patches
long ago (mid February) well before leaving, and Victor has been constantly
pinging for reviews, ready to address them.

Ayal.


> I think it's time to declare 4.2 feature-complete.
>
> That leaves us looking at regressions.  There are lots; presently 56
> P1s.  About half of those are new in 4.2.  So, we're not ready to create
> a 4.2 branch.
>
> Therefore, we need to make the mainline open for regression-fixes only
> to force ourselves to attack the open bugs.  Please consider the
> mainline operating under release-branch rules as of 12:01 AM Wednesday,
> California time.  That will give everyone a few days to check in any
> in-progress bug-fixes that are not regressions.
>
> At this time, I don't think it makes sense to set a 4.2 target branch
> date.  We have to see how fast the bug-fixing goes.
>
> --
> Mark Mitchell
> CodeSourcery
> [EMAIL PROTECTED]
> (650) 331-3385 x713



Re: Fw: GCC 4.2 Status Report (2006-06-04)

2006-06-07 Thread Ayal Zaks
"Richard Guenther" <[EMAIL PROTECTED]> wrote on 06/06/2006
16:58:27:

> On 6/6/06, Ayal Zaks <[EMAIL PROTECTED]> wrote:
> >
> > > This status report has been a long time coming, for which I
apologize.
> > >
> > > As fwprop is no longer on the table for 4.2, and as the vectorizer
> > > improvements seem to have stalled due to a combination of lack of
review
> > > and Dorit's leave,
> >
> > That is unfortunate. Dorit did make a sincere effort to prepare her
patches
> > long ago (mid February) well before leaving, and Victor has been
constantly
> > pinging for reviews, ready to address them.
>
> 
> Maybe you can look at some of the regressions of the vectorizer first,
> instead of
> adding new features without first addressing regressions?
>
> This could build up some trust that the newly added code will actually
> be maintained
> in the future.
> 
>
> For a quick bugzilla search, use the 4.2 regressions link on the
> gcc.gnu.org page
> and modify it to include "-ftree-vectorize" in the bug description.  I
> count 3 P1 and
> 3 P2 regressions.
>
> Richard.


This is not entirely fair.

We regretfully have been neglecting PRs recently, partly because of dorit's
leave, but in the past we have always addressed PRs and maintained our
code.  During the time that PRs weren't being addressed by us we also
haven't submitted any new features to mainline.  In fact, if you check,
you'll find that when the features in question were submitted (around
February), we also gave a lot of attention to PRs.  Having said that, we
should and are looking into current PRs; we definitely have a strong
interest in seeing them resolved.

Ayal.



Workshop on GCC for Research in Embedded and Parallel Systems

2007-06-26 Thread Ayal Zaks

Acronymed GREPS (... just what you were looking for), is to be held on
September 16 in Brasov, Romania, co-located with PACT. We'd like to bring
this workshop to your attention; the submission site is now open until July
24, after the upcoming GCC Developers' Summit. For more details see
http://sysrun.haifa.il.ibm.com/hrl/greps2007/

Thanks, Albert and Ayal.



Re: Does unrolling prevents doloop optimizations?

2007-06-30 Thread Ayal Zaks
Zdenek Dvorak <[EMAIL PROTECTED]> wrote on 30/06/2007 09:55:45:

> Hello,
>
> > By "this change" I mean just commenting out the check in
> > doloop_condition_get. After applying the patch that introduced DOLOOP
> > patterns for SPU
(http://gcc.gnu.org/ml/gcc-patches/2007-01/msg01470.html)
> > we needed this hack in order to be able to use the doloop_condition_get
to
> > return the register decremented by the branch instruction for any
unrolled
> > loop (The unroller changed originally GE loops to EQ ). Where can this
check
> > be required? Note that we did not touched the similar check in
> > doloop_modify. We tested this on our SPU branch and saw no regressions.
>
> hmmm I see now that modulo-sched.c:doloop_register_get uses
> doloop_condition_get, which is why this may affect something.  Anyway,
> changing doloop_condition_get is wrong.  Just teach modulo-sched to use
> number of iterations analysis from loop-iv.c instead of the
> doloop_register_get hack,
>

Here's my understanding.
When the doloop pass looks at a loop, it's doloop_condition_get makes sure
that the closing branch is of a certain form that doloop_modify can handle
(with GE or NE condition only). Changing doloop_condition_get without a
similar change to doloop_modify is expected to trigger some assert in
doloop_modify, or specifically gcc_unreachable in case it sees an EQ
condition.

(A few passes later,) When the sms pass looks at a loop, it apparently sees
this loop closed by an EQ branch, (whose target leaves the loop, and
falling-through to a goto to the beginning of the loop?). This reversal is
presumably related to unrolled loops. Vladimir, it could help if you show
what the EQ closing branch and its target/fallthrough blocks in question
look like.

It doesn't seem that the number of iterations analysis from loop-iv.c deals
with EQ closing branches. One option is for sms to use
doloop_condition_get/loop-iv analysis in their current form, and if failed
check (on our own) for reversed doloop patterns. Another is to try and
reverse the closing branch before calling doloop_condition_get/loop-iv
analysis; any suitable branch reversal facility available?

Ayal.



Re: Does unrolling prevents doloop optimizations?

2007-06-30 Thread Ayal Zaks
Zdenek Dvorak <[EMAIL PROTECTED]> wrote on 30/06/2007 10:52:32:

> Hello,
>
> > It doesn't seem that the number of iterations analysis from loop-iv.c
deals
> > with EQ closing branches.
>
> loop-iv works just fine for EQ closing branches.
>

Thanks for the clarification (I didn't see EQ in iv_number_of_iterations's
switch (cond)). In that case, sms should definitely try to use it.

Ayal.


> Zdenek
>
> > One option is for sms to use
> > doloop_condition_get/loop-iv analysis in their current form, and if
failed
> > check (on our own) for reversed doloop patterns. Another is to try and
> > reverse the closing branch before calling doloop_condition_get/loop-iv
> > analysis; any suitable branch reversal facility available?
> >
> > Ayal.



Re: Does unrolling prevents doloop optimizations?

2007-06-30 Thread Ayal Zaks
Zdenek Dvorak <[EMAIL PROTECTED]> wrote on 30/06/2007 20:36:12:

> Hello,
>
> > > > It doesn't seem that the number of iterations analysis from
loop-iv.c
> > deals
> > > > with EQ closing branches.
> > >
> > > loop-iv works just fine for EQ closing branches.
> > >
> >
> > Thanks for the clarification (I didn't see EQ in
iv_number_of_iterations's
> > switch (cond)).
>
> that is because we canonicalize the condition before that to a NE
> condition.
>

Ah, I see now that check_simple_exit does this reversal right before
calling iv_number_of_iterations. This is the desired
>any suitable branch reversal facility available?
so sms should definitely try to use get_simple_loop_desc instead of
doloop_condition_get.

Thanks, Ayal.



Fw: Failing matrix tests

2007-07-14 Thread Ayal Zaks

> Is anyone else seeing failures on the gcc.dg/matrix tests?  I am getting
> the following failures on IA64 HP-UX:
>

Hope Razya can help spot the cause, after she returns from vacation later
this week.
Ayal.


> FAIL: gcc.dg/matrix/matrix-1.c scan-ipa-dump-times Flattened 3 dimensions
1
> FAIL: gcc.dg/matrix/matrix-2.c scan-ipa-dump-times Flattened 2 dimensions
1
> FAIL: gcc.dg/matrix/matrix-3.c scan-ipa-dump-times Flattened 2 dimensions
1
> FAIL: gcc.dg/matrix/matrix-6.c scan-ipa-dump-times Flattened 2 dimensions
1
> FAIL: gcc.dg/matrix/transpose-1.c scan-ipa-dump-times Flattened 3
dimensions 1
> FAIL: gcc.dg/matrix/transpose-1.c scan-ipa-dump-times Transposed 3
> FAIL: gcc.dg/matrix/transpose-2.c scan-ipa-dump-times Flattened 3
dimensions 1
> FAIL: gcc.dg/matrix/transpose-3.c scan-ipa-dump-times Flattened 2
dimensions 1
> FAIL: gcc.dg/matrix/transpose-3.c scan-ipa-dump-times Transposed 2
>
> On my HPPA platforms I also get:
>
> FAIL: gcc.dg/matrix/transpose-4.c scan-ipa-dump-times Flattened 3
dimensions 1
> FAIL: gcc.dg/matrix/transpose-4.c scan-ipa-dump-times Transposed 2
> FAIL: gcc.dg/matrix/transpose-5.c scan-ipa-dump-times Flattened 3
dimensions 1
> FAIL: gcc.dg/matrix/transpose-6.c scan-ipa-dump-times Flattened 3
dimensions 1
>
> The main difference between IA64 HP-UX and IA64 Linux (where I do not
> see these failures) is big-endian vs. little-endian so I was wondering
> if any other big-endian platforms in particular are seeing these
> failures?  When I look at the matrix-1.c.038i.matrix-reorg file output
> on HP-UX and Linux, the Linux one starts with some comments about
> putting symbols into SSA form and doing an incremental SSA update.  That
> is missing from the HP-UX version, HP-UX starts with:
>
>Matrix "vel"; Escaping Level: 1, Num Dims: 3, Malloc Dims: 1,
>mem_init ()
>{
>
> Linux (after the SSA comments) has:
>
>Matrix "vel"; Escaping Level: 3, Num Dims: 3, Malloc Dims: 3,
>Flattened 3 dimensions
>mem_init ()
>{
>
> Andrew, I cc'ed you because of this email
> , it mentions some
> matrix failures on MIPS when testing on the PTR-PLUS branch.  I don't
> actually know if my failures are related to PTR-PLUS or not.
>
> Steve Ellcey
> [EMAIL PROTECTED]



Re: Workshop on GCC for Research in Embedded and Parallel Systems

2007-08-01 Thread Ayal Zaks
==
  CALL FOR PAPERS - One Final Week Extension
  GREPS '07
Workshop on GCC for Research in Embedded and Parallel Systems
   Brasov, Romania, September 16, 2007
  http://sysrun.haifa.il.ibm.com/hrl/greps2007/
 in conjunction with
   PACT '07
 http://pactconf.org
==

We are honored to have two prominent keynote speakers:

Paul H J Kelly
  Imperial College, London
  Title: GCC in software performance research: just plug in

Benoit Dupont de Dinechin
  ST Microelectronics, Grenoble, France
  Title: GCC for Embedded VLIW Processors: Why Not?

A final extension of one week has been granted: submissions are due August
7,
2007; They are to be 6-12 pages long, for review purposes.
The submission site is http://papers.haifa.il.ibm.com/greps2007/
For more details see http://sysrun.haifa.il.ibm.com/hrl/greps2007/

Early notification of the intention to participate (submit and/or attend)
would be helpful.

Important Dates
Papers due: August 7, 2007
Acceptance notices: August 13, 2007
Workshop date: September 16, 2007



Ayal Zaks/Haifa/IBM wrote on 26/06/2007 18:08:43:

> Acronymed GREPS (... just what you were looking for), is to be held on
> September 16 in Brasov, Romania, co-located with PACT. We'd like to bring
this
> workshop to your attention; the submission site is now open until July
24,
> after the upcoming GCC Developers' Summit. For more details see
http://sysrun.
> haifa.il.ibm.com/hrl/greps2007/
>
> Thanks, Albert and Ayal.



Re: [RFC] Improve Tree-SSA if-conversion - convergence of efforts

2007-08-01 Thread Ayal Zaks
"Daniel Berlin" <[EMAIL PROTECTED]> wrote on 01/08/2007 18:27:35:

> On 8/1/07, Tehila Meyzels <[EMAIL PROTECTED]> wrote:
> > "Daniel Berlin" <[EMAIL PROTECTED]> wrote on 31/07/2007 18:00:57:
> >
> > >
> > > I agree with you for conditional stores/loads.
> >
> > Great!
> >
> > >
> > > The unconditional store/load stuff, however, is exactly what
> > > tree-ssa-sink was meant to do, and belongs there (this is #3 above).
> > > I'm certainly going to fight tooth and nail against trying to
shoehorn
> > > unconditional store sinking into if-conv.
> >
> > Sometimes, store-sinking can cause performance degradations.
> > One reason for that, is increasing register pressure, due to extending
life
> > range of registers.
> >
> > In addition, in case we have a store followed by a branch, store
sinking
> > result will be a branch followed by a store.
> > On some architectures, the former can be executed in parallel, as
opposed
> > to the latter.
> > Thus, in this case, it worth executing store-sinking only when it helps
the
> > if-conversion to get rid of the branch.
> >
>
> > How do you suggest to solve this problem, in case store-sinking will be
> > part of the tree-sink pass?
> >
> Store sinking already *is* part of the tree-sink pass. It just only
> sinks a small number of stores.
> The solution to the problem that "sometimes you make things harder for
> the target" is to fix that in the backend.  In this case, the
> scheduler will take care of it.
>
> All of our middle end optimizations will sometimes have bad effects
> unless the backend fixes it up.Trying to guess what is going to
> happen 55 passes down the line is a bad idea unless you happen to be a
> very good psychic.
>
> As a general rule of thumb, we are happy to make the backend as target
> specific and ask as many target questions as you like.  The middle
> end, not so much.  There are very few passes in the middle end that
> can/should/do ask anything about the target.  Store sinking is not one
> of them, and I see no good reason it should be.
>
> > Another point, what about (unconditional) load hoisting:
> > It's surely not related to sink pass, right?
> >
> PRE already will hoist unconditional loads out of loops, and in places
> where it will eliminate redundancy.
>
> It could also hoist loads in non-redundancy situations, it is simply
> the case that it's current heuristic  does not think this is a good
> idea.
>

Hoisting a non-redundant load speculatively above an if may indeed be a bad
idea, unless that if gets converted as a result (and possibly even then
...).  Are we in agreement then that unconditional load/store motion for
the sake of redundancy elimination continues to belong to PRE/tree-sink,
and that conditional load/store motion for the sake of conditional-branch
elimination better be coordinated by if-cvt?

Ayal.

> Thus, if you wanted to do unconditional load hoisting, the thing to do
> is to make a function like do_regular_insertion in tree-ssa-pre.c, and
> call it from insert_aux.
>
> We already have another heuristic for partially antic fully available
> expressions, see do_partial_partial_insertion



Re: [RFC][modulo-sched] Fix scheduling order within a cycle

2007-11-13 Thread Ayal Zaks
Revital1 Eres/Haifa/IBM wrote on 13/11/2007 15:13:25:

> Hello,
>
> Following our off-line discussion; attached is the dump file of the
> problematic testcase. (sms-4.c)
>
> Here are some relevant details:
>
> insn 61(write) and 58(read) were swapped, so sms tries to generate some
> kind of reg-move and fails.
>
> [CYCLE 4 ]: 61, 58,
>

Yes, this is wrong, clearly breaking the 58->(A,0,0)->61 dependence.


> insn 61 is the only already scheduled instruction which insn 58 depends
> on.
>
> The window of insn 58 is (4 .. -1), calculated as follows:
>
> Processing edge: [61 -(T,14,1)-> 58]
> Scheduling 8 (58) in psp_pss_not_empty, checking p 11 (61): pred st = 4;
> early_start = 0; latency: 14
> Processing edge: [58 -(A,0,0)-> 61]
> Scheduling 8 (58) in psp_pss_not_empty, checking s 11 (61): succ st = 4;
> late_start = 4; latency: 0
> Processing edge: [58 -(T,2,0)-> 59]
> Scheduling 8 (58) in psp_pss_not_empty, checking s 9 (59): the node is
not scheduled
>
> Trying to schedule node 8 INSN = 58  in (4 .. -1)
step -1
> Scheduled w/o split in 4
>
> insn 61 only must_preceed insn 58 because the latency between 61->58 is
> 14 which causes the end boundary of the window to be 0.
>
> Thanks,
> Revital
>
> [attachment "test.c.172r.sms" deleted by Ayal Zaks/Haifa/IBM]

Indeed the latency between 61->58 is 14 causing end boundary of the window
to be 0: insn 61 is scheduled in cycle 4 and II = 18, so 61->(T,14,1)->58
implies that insn 58 can be scheduled no earlier than 4 - 18 + 14 = 0. But
it is not the case that insn 61 must_preceed insn 58, because insn 61 is
scheduled in cycle 4 which is not equal to 0 (modulo 18).

However, insn 61 must_follow insn 58 when attempting to place insn 58 in
cycle 4.


When scheduling insn 58, we calculate a window of possible cycles according
to already scheduled predecessors and successors. This window looks like a
parallelogram in general rather than a rectangle: in the first cycle there
may be predecessors (already scheduled in the first cycle, or a multiple of
II cycles away) which must_preceed insn 58 (having tight dependence with
insn 58 if it is placed in the first cycle). So insn 58 can be placed in
'rightmost' slots of the first cycle only. Similarly, in the last cycle,
insn 58 might be placed in 'leftmost' slots only, due to successors which
must_follow insn 58. Inside internal cycles (strictly between first and
last cycles), insn 58 can be placed in any vacant slot.

Now if (as in the above case) an already scheduled insn 61 is both a
successor and a predecessor of insn 58, it may be that (not it the above
case) insn 61 must_precede insn 58 (when looking for available slots for
insn 58 in the first cycle) and must_follow insn 58 (when looking for
available slots in the last cycle).

Currently we apply the must_precede and must_follow restrictions to all
cycles of the window. This is overly conservative (i.e., should not produce
above wrong code!). One way to improve this is to split the window into
first cycle (with must_precede), internal part (with neither must_precede
nor must_follow), and last cycle (with must_follow). And, of-course, if
first cycle == last cycle, apply both must_precede and must_follow for it.


Finally, note that in the above case we traverse the window backwards with
step -1, so 'start' is the last cycle 4, and 'end' is one past the first
cycle 0 (i.e. -1).

Ayal.



Re: [RFC][modulo-sched] Fix scheduling order within a cycle

2007-11-15 Thread Ayal Zaks
Revital1 Eres/Haifa/IBM wrote on 14/11/2007 18:46:14:

>
> >
> > When scheduling insn 58, we calculate a window of possible cycles
according
> > to already scheduled predecessors and successors. This window looks
like a
> > parallelogram in general rather than a rectangle: in the first cycle
there
> > may be predecessors (already scheduled in the first cycle, or a
multiple of
> > II cycles away) which must_preceed insn 58 (having tight dependence
with
> > insn 58 if it is placed in the first cycle). So insn 58 can be placed
in
> > 'rightmost' slots of the first cycle only. Similarly, in the last
cycle,
> > insn 58 might be placed in 'leftmost' slots only, due to successors
which
> > must_follow insn 58. Inside internal cycles (strictly between first and
> > last cycles), insn 58 can be placed in any vacant slot.
> >
> > Now if (as in the above case) an already scheduled insn 61 is both a
> > successor and a predecessor of insn 58, it may be that (not it the
above
> > case) insn 61 must_precede insn 58 (when looking for available slots
for
> > insn 58 in the first cycle) and must_follow insn 58 (when looking for
> > available slots in the last cycle).
> >
> > Currently we apply the must_precede and must_follow restrictions to all
> > cycles of the window. This is overly conservative (i.e., should not
produce
> > above wrong code!). One way to improve this is to split the window into
> > first cycle (with must_precede), internal part (with neither
must_precede
> > nor must_follow), and last cycle (with must_follow). And, of-course, if
> > first cycle == last cycle, apply both must_precede and must_follow for
it.
> >
> >
> > Finally, note that in the above case we traverse the window backwards
with
> > step -1, so 'start' is the last cycle 4, and 'end' is one past the
first
> > cycle 0 (i.e. -1).
>
> Thanks for the explanation!  Is it reasonable to conclude that for the
> must_follow/must_precede calculation we should not relay on start/end
> rows (as they are influenced from the direction of the window); but
> use win_boundary_close_to_preds row and win_boundary_close_to_succs
> row, calculated from start and ends rows depending on the direction
> of the window (if step is 1 than in_boundery_close_to_preds
> = start; if step is -1 than  in_boundary_close_to_preds = end, etc).
> win_boundary_close_to_preds will be used only for must_precede
calculation
> and win_boundary_close_to_succs row will be used only for must_follow
> as you described above.
>

One way to do this is inside the

for (c = start; c != end; c += step)
  {
verify_partial_schedule (ps, sched_nodes);

psi = ps_add_node_check_conflicts (ps, u_node, c,
   must_precede,
   must_follow);
...

loop, set:

tmp_precede = zero bitmap
tmp_follow = zero bitmap
if (c == start)
  if (step == 1)
tmp_precede = must_precede;
  else /* step == -1.  */
tmp_follow = must_follow;
if (c == end - step)
  if (step == 1)
tmp_follow = must_follow;
  else /* step == -1.  */
tmp_precede = must_precede;

and then call
psi = ps_add_node_check_conflicts (ps, u_node, c,
   tmp_precede,
   tmp_follow);


Ayal.

> Thanks,
> Revital