LOOP_HEADER tree code?

2006-10-23 Thread Zdenek Dvorak
Hello,

for project http://gcc.gnu.org/wiki/PreservingLoops, I am considering
introducing a tree LOOP_HEADER with single argument N (number of
iterations of the loop), that would be present in IL at the beginning of
header of each loop.  I have following motivations:

1) One of the goals of the project is to avoid recomputing number of
   iterations of loops, and to keep this information (analysing # of
   iterations is fairly expensive, and we recompute this information
   quite a lot at the moment).  To keep the information valid, we need
   to prevent optimizations from destroying it (e.g., if the number
   is n_1 = n_2 - 1, and this is the last use of n_1, we do not want
   DCE to remove it); this is easy to achieve if n_1 would be the
   argument of LOOP_HEADER.  Without this tree node in IL, we would need
   to handle this specially in DCE, and possibly also other optimizers.
2) This statement might make it simpler to ensure that various
   optimizers update the loops correctly.

I am not quite sure whether this is a good idea, though.  Are there some
reasons why not to do this, or any other problems with the idea?

Zdenek


Re: LOOP_HEADER tree code?

2006-10-24 Thread Zdenek Dvorak
Hello,

> On 10/23/06, Zdenek Dvorak <[EMAIL PROTECTED]> wrote:
> >Hello,
> >
> >for project http://gcc.gnu.org/wiki/PreservingLoops, I am considering
> >introducing a tree LOOP_HEADER with single argument N (number of
> >iterations of the loop), that would be present in IL at the beginning of
> >header of each loop.  I have following motivations:
> >
> >1) One of the goals of the project is to avoid recomputing number of
> >   iterations of loops, and to keep this information (analysing # of
> >   iterations is fairly expensive, and we recompute this information
> >   quite a lot at the moment).  To keep the information valid, we need
> >   to prevent optimizations from destroying it (e.g., if the number
> >   is n_1 = n_2 - 1, and this is the last use of n_1, we do not want
> >   DCE to remove it); this is easy to achieve if n_1 would be the
> >   argument of LOOP_HEADER.  Without this tree node in IL, we would need
> >   to handle this specially in DCE, and possibly also other optimizers.
> >2) This statement might make it simpler to ensure that various
> >   optimizers update the loops correctly.
> >
> >I am not quite sure whether this is a good idea, though.  Are there some
> >reasons why not to do this, or any other problems with the idea?
> >
> 
> Why not keeping this information outside the IL

I am not sure what you mean by that?  IMHO this is what I propose the
LOOP_HEADER tree node for.

> and making sure that code
> transformers are updating that information if they modify the symbols 
> involved
> in that representation?
> 
> The same problems will also occur when transmitting to the backend
> other informations attached to loop structures.  I'm not sure that
> we want to create a new node for data dependences as well...

for data dependences, you need to transfer just dependence
vectors/distances, i.e., you do not need to worry about symbols.
So I do not see a reason for using some node in IL for data dependences.

Zdenek


Re: LOOP_HEADER tree code?

2006-10-25 Thread Zdenek Dvorak
Hello,

> >> >   quite a lot at the moment).  To keep the information valid, we need
> >> >   to prevent optimizations from destroying it (e.g., if the number
> >> >   is n_1 = n_2 - 1, and this is the last use of n_1, we do not want
> >> >   DCE to remove it); this is easy to achieve if n_1 would be the
> >> >   argument of LOOP_HEADER.
> 
> You are proposing to complete the ssa representation such that
> foreach_ssa_uses also iterates over the niter information (a bit like vrp
> modifies the ssa chains with its extra assert information).  Wouldn't it
> be possible to not insert this niter information in the representation of 
> the
> program (ie. not inserting a new tree node) but just modify the ssa 
> iterators
> to also return the expressions that we store on the side?

yes, it would be possible, however, possibly quite error-prone.

> >> >   Without this tree node in IL, we would need
> >> >   to handle this specially in DCE, and possibly also other optimizers.
> 
> What are the other transformations that remove definitions that are not 
> used?

ivopts remove bivs that they replace, and I suspect that vectorizer
does something similar as well.  I do not know about other optimizers.

Zdenek


Re: LOOP_HEADER tree code?

2006-10-25 Thread Zdenek Dvorak
Hello,

> >> You are proposing to complete the ssa representation such that
> >> foreach_ssa_uses also iterates over the niter information (a bit like vrp
> >> modifies the ssa chains with its extra assert information).  Wouldn't it
> >> be possible to not insert this niter information in the representation of
> >> the
> >> program (ie. not inserting a new tree node) but just modify the ssa
> >> iterators
> >> to also return the expressions that we store on the side?
> >
> >yes, it would be possible, however, possibly quite error-prone.
> >
> >> What are the other transformations that remove definitions that are not
> >> used?
> >
> >ivopts remove bivs that they replace, and I suspect that vectorizer
> >does something similar as well.  I do not know about other optimizers.
> >
> 
> The scev info has the same problem as the niter info, as we store it on
> the side, making it vulnerable to any code transform.  Now if we modify
> the var renaming, removal, ... aware of the information that we store
> on the side, we could avoid all these rather costly re-computations.

I did that once.  Updating scev info turned out to be more expensive than
recomputing it.

Zdenek


Maintainer(s) for loop optimizer(s)

2006-10-25 Thread Zdenek Dvorak
Hello,

at the moment, RTL level loop optimizers and most of the tree-level loop
optimizers do not have assigned specific maintainers.  I think this
clearly starts to become a significant problem -- many of the
loop-optimizer related patches in 4.2 timeframe remained unreviewed (the
vectorizer improvement changes, some of my patches for ivopts cleanup,
etc.).  In 4.3 and 4.4 timeframe, there are several projects related to
loop optimizations (http://gcc.gnu.org/wiki/AutovectBranchOptimizations,
http://gcc.gnu.org/wiki/PredictiveCommoning,
http://gcc.gnu.org/wiki/AutomaticParallelization,
http://gcc.gnu.org/wiki/PrefetchingImprovements,
http://gcc.gnu.org/wiki/PreservingLoops, Graphite), that I fear might
face delays because of the same problem (see e.g. the (lack of) results
of my attempt to find reviewers for some of them,
http://gcc.gnu.org/ml/gcc/2006-09/msg00477.html).

And, of course, there are other situations where the fact that there is
no official maintainer is a problem (when looking for someone able to
fix a bug in one of the optimizers, or if someone has questions
regarding them).

On gcc summit, I discussed this with several active loop optimization
developers (Daniel Berlin, Sebastian Pop), and we agreed that one
possible solution would be to have myself and Daniel Berlin to
co-maintain the loop optimizers.  This was proposed to the steering
committee several months ago; however, since there was no progress in
the meantime, I assume this proposal was rejected.

Therefore, I would like to ask steering committee to propose an
alternative solution.  There does not seem to be any easy way how to
address a mail just to the steering committee, so I am sending this to
the gcc mailing list in hope that it will reach the right recipients.

Zdenek


Re: LOOP_HEADER tree code?

2006-10-25 Thread Zdenek Dvorak
Hello,

> >for project http://gcc.gnu.org/wiki/PreservingLoops, I am considering
> >introducing a tree LOOP_HEADER with single argument N (number of
> >iterations of the loop), that would be present in IL at the beginning of
> >header of each loop.  I have following motivations:
> >
> >1) One of the goals of the project is to avoid recomputing number of
> >  iterations of loops, and to keep this information (analysing # of
> >  iterations is fairly expensive, and we recompute this information
> >  quite a lot at the moment).  To keep the information valid, we need
> >  to prevent optimizations from destroying it (e.g., if the number
> >  is n_1 = n_2 - 1, and this is the last use of n_1, we do not want
> >  DCE to remove it); this is easy to achieve if n_1 would be the
> >  argument of LOOP_HEADER.  Without this tree node in IL, we would need
> >  to handle this specially in DCE, and possibly also other optimizers.
> >2) This statement might make it simpler to ensure that various
> >  optimizers update the loops correctly.
> 
> However, various optimizer needs to know about this special tree node.

not really (not any more than they know about other tree codes that are
not interesting for them).

Zdenek

> Or am I missing something?
> 
> >
> >I am not quite sure whether this is a good idea, though.  Are there some
> >reasons why not to do this, or any other problems with the idea?
> >
> >Zdenek
> 
> 
> -
> Devang


Re: LOOP_HEADER tree code?

2006-10-25 Thread Zdenek Dvorak
Hello,

> > > for project http://gcc.gnu.org/wiki/PreservingLoops, I am considering
> > > introducing a tree LOOP_HEADER with single argument N (number of
> > > iterations of the loop), that would be present in IL at the beginning of
> > > header of each loop.  I have following motivations:
> > >
> > >1) One of the goals of the project is to avoid recomputing number of
> > >   iterations of loops, and to keep this information (analysing # of
> > >   iterations is fairly expensive, and we recompute this information
> > >   quite a lot at the moment).  To keep the information valid, we need
> > >   to prevent optimizations from destroying it (e.g., if the number
> > >   is n_1 = n_2 - 1, and this is the last use of n_1, we do not want
> > >   DCE to remove it); this is easy to achieve if n_1 would be the
> > >   argument of LOOP_HEADER.  Without this tree node in IL, we would need
> > >   to handle this specially in DCE, and possibly also other optimizers.
> > >2) This statement might make it simpler to ensure that various
> > >   optimizers update the loops correctly.
> > 
> > However, various optimizer needs to know about this special tree node.
> > Or am I missing something?
> Correct.  Take for example jump threading -- it can peel off a loop
> iteration which would invalidate the LOOP_HEADER node.

of course, if it does that, it must update the information at the loop
(regardless of whether we use the LOOP_HEADER node for it, or not).

> That seems
> like a recipe for disaster.

No, that seems like a good way how to detect that jump threading is
doing something it should not, or forgetting to update something in the
progress.

> Code motion passes would have to be aware of this special node as well.
> And we've seen how well that works in the RTL backend with its loop
> notes -- it was a major PITA.

The problem was that loop notes was something special, different from
other RTL statements.  Of course, we can live without LOOP_HEADER tree
node; however, that means that more optimizers will have to be aware of
information stored in loop structures, which looks more like the case
with the loop notes to me.

Zdenek

> I'd rather not repeat the mistakes we've already made.
> 
> Jeff
> 


Re: Re: LOOP_HEADER tree code?

2006-10-25 Thread Zdenek Dvorak
Hello,

> >So, the passes that maniuplate loop structure need to know about
> >LOOP_HEADER and others do not need to worry about LOOP_HEADER.
> 
> More acurately, the passes that manipulate the cfg. Right now most of
> these passes don't even know they modify the loop structure.
> 
> >Now, focusing on the passes that manipulate loop structure. Are these
> >pass responsible for fixing loop info or is it responsiblity of cleanup 
> >pass ?
> 
> It seems to me that a cleanup pass would defeat the purpose of keeping
> loop info up to date. Your cleanup pass would probably end up just
> recomputing everything.
> 
> That said, I don't really see what a LOOP_HEADER node would give you
> that you can't get by making the cfg-modifying passes actually
> loop-aware, or perhaps by using cfghooks to update the loop
> information on the fly when a pass changes the CFG. It would be
> helpful if Zdenek could give an example where a LOOP_HEADER node is
> really the only way to help keep loop info accurate.

it definitely is not the only way, and seeing the reaction of people,
I probably won't use it.  The main reason for considering to use
the tree node for me was the possibility to make the number of iterations
of the loop as its operand, so that I would not need to worry about
keeping it alive through dce, copy/constant propagation, etc. (without
a statement carrying it in IL, I do not see a solution that would not
be just asking for introducing bugs and getting broken accidentally).

Zdenek


Re: Re: Re: Re: Re: Re: LOOP_HEADER tree code?

2006-10-25 Thread Zdenek Dvorak
Hello,

> On 10/25/06, Steven Bosscher <[EMAIL PROTECTED]> wrote:
> >
> >So you would mark n_1 with TREE_USED, and never let it be removed?
> >What would happen if e.g. the entire loop turns out to be dead code?
> >Or if the loop is rewritten (e.g. vectorized) in a way that changes
> >the number of iterations of the loop? Then the assignment to n_1 would
> >be _really_ dead, but there wouldn't be any way to tell.
> 
> When it is really dead, you'll have to remove LOOP_HEADER node
> anyway.

if the loop becomes unreachable, the LOOP_HEADER statement gets removed
like any other statement.

If there are some changes to the structure of the loop (the loop is
cancelled by some optimizer, or the number of iterations changes), the
optimizer is required to update loop structures, and LOOP_HEADER node is
removed/updated in the progress.  These updates need to be done on very
few places (basically only in the routines provided to manipulate the
loop structures). 

What is quite important, it is possible to verify the correctness of the
code (from e.g. verify_flow_info), and thus quickly catch any problems.

> Right ? Why not instead reset the TREE_USED bit and let DCE do
> its job ?

Many optimizers would need to be taught to know about TREE_USED (or any
other bit you would use for that).  We do not have this type of
restriction for any other ssa names, so this would be brand new
functionality (on the other hand, most optimizers would not need to know
anything special about LOOP_HEADER statements).

This definitely is not a way I would take if I decide not to use
LOOP_HEADER.  I would probably add some sort of "implicit" uses placed
on some basic blocks (or in loops).  It would be a bit more difficult
to teach the relevant optimizers to know about these uses, but at least
I won't be introducing a new brand of ssa names that behave differently
than all the other ssa names.

Zdenek


Re: Re: LOOP_HEADER tree code?

2006-10-25 Thread Zdenek Dvorak
Hello,

> On Wed, 2006-10-25 at 13:01 -0700, Devang Patel wrote:
> > > > However, various optimizer needs to know about this special tree node.
> > >
> > > not really (not any more than they know about other tree codes that are
> > > not interesting for them).
> > 
> > If we take an example of Jump Threading pass then it needs to  know
> > about this tree node and update it properly.
> That could get awful ugly.

actually, that will be trivial once jump threading updates loop properly
(I have a patch for that).

> > So, the passes that maniuplate loop structure need to know about
> > LOOP_HEADER and others do not need to worry about LOOP_HEADER.
> Passes which do code motions may need to know about it -- they don't
> need to update its contents, but they may need to be careful about
> how statements are moved around in the presence of a LOOP_HEADER note.

Which is IMHO good, as it will at least ensure that they preserve loops,
in case it is really relevant for them -- just now, I do not see how
code motion could affect loop structures (unless it changes CFG) -- do
you have some concrete example of transformation that might be made
more problematic by the existence of LOOP_HEADER node?

Zdenek


Re: Re: LOOP_HEADER tree code?

2006-10-25 Thread Zdenek Dvorak
Hello,

> On Thu, 2006-10-26 at 00:45 +0200, Zdenek Dvorak wrote:
> 
> > actually, that will be trivial once jump threading updates loop properly
> > (I have a patch for that).
> I'd like to see that.
> 
> I recall poking at having threading update things like loop exit
> points and gave up.  The problem is you could thread to a loop 
> exit, which in turn might move the loop exit edge to a new
> point in the CFG.  Furthermore, all the statements dominated
> by the new loop exit edge are no longer in the loop (when they
> previously were part of the loop).
> 
> It's possible these problems could be solved by querying the
> dominator structures, but IIRC they aren't accurate at this
> point.

what my patch does is keeping the loops up-to-date.
To do that, I had to disable some of the threadings (mostly those
that made us create new subloops of the existing loops); on the other
hand, some other threadings that are now forbidden turned out to be
valid, and overall I think none of the important cases is affected.

I will update and submit the patch sometime later this week.

> Your updates may be simpler/easier since you're updating
> something different.
> 
> [ This was a couple years ago, so if I've got details wrong,
>   please forgive me. ] 
> 
> 
> > Which is IMHO good, as it will at least ensure that they preserve loops,
> > in case it is really relevant for them -- just now, I do not see how
> > code motion could affect loop structures (unless it changes CFG) -- do
> > you have some concrete example of transformation that might be made
> > more problematic by the existence of LOOP_HEADER node?
> It wasn't a matter of changing the loop structure -- the RTL loop
> optimizer had 3 or 4 notes IIRC and the location of each note
> in relation to code labels and other insns was important.  Code
> motion would sometimes place an insn between the note and a
> label.  That in turn caused the loop optimizer to barf.  There
> were also problems with edge splitting doing similar kinds of
> things.

Main source of these problems with the loop notes was that their
validity was never verified (only in the loop optimizer itself);
so you did not know you made something wrong with them, until
some problem in the loop optimizer surfaced.  This should not be
the case with LOOP_HEADER statements, as their validity would be
verified (in verify_flow_info or verify_loop_structures).

What made the problems with loop notes worse was that loop discovery
itself was based on them.  This meant that it was quite easy to get
the loop optimizer confused if you changed the positions of the notes.
Since nowadays the loop discovery is based purely on CFG, this would not
be a problem.

For these reasons, I do not think that making the optimizers keep LOOP_HEADER
statements and loops in sync would be too difficult (at least not more
difficult than making the optimizers to keep CFG and loop structures in
sync).

Zdenek


Re: Induction variable optimization

2006-11-05 Thread Zdenek Dvorak
Hello,

> I am a M.E.Computer science student and doing project on induction variable
> optimization.
> 
> Therefore i am reading the file tree-ssa-loop-ivopts.c of gcc-4.0.2 to know
> about what have implemented in that.
>
> Is there any other way to know about what have implemented yet and in
> gcc-4.0.2. Can i know the algorithm.?

the overview of the algorithm is in the comment at the start of
tree-ssa-loop-ivopts.c.

Zdenek


Re: Zdenek Dvorak and Daniel Berlin appointed loop optimizer maintainers

2006-11-15 Thread Zdenek Dvorak
Hello,

>   I am pleased to announce that the GCC Steering Committee has
> appointed Zdenek Dvorak and Daniel Berlin as non-algorithmic maintainers
> of the RTL and Tree loop optimizer infrastructure in GCC.

thank you.  What exactly does "non-algorithmic" mean in this context?

Zdenek

>   Please join me in congratulating Zdenek and Daniel on their new
> role.  Zdenek and Daniel, please update your listings in the MAINTAINERS
> file.
> 
> Happy hacking!
> David


Re: Zdenek Dvorak and Daniel Berlin appointed loop optimizer maintainers

2006-11-16 Thread Zdenek Dvorak
Hello,

>   I am pleased to announce that the GCC Steering Committee has
> appointed Zdenek Dvorak and Daniel Berlin as non-algorithmic maintainers
> of the RTL and Tree loop optimizer infrastructure in GCC.
> 
>   Please join me in congratulating Zdenek and Daniel on their new
> role.  Zdenek and Daniel, please update your listings in the MAINTAINERS
> file.

done.  I am not sure whether it would be useful to indicate
non-algorithmicness somehow in the MAINTAINERS file?

Zdenek


Re: Why does flow_loops_find modify the CFG, again?

2006-11-19 Thread Zdenek Dvorak
Hello,

> I'm running into some troubles with an if-conversion pass that runs
> after reload, where we have to avoid lifting insns across a loop
> exit edge into a loop.  ifcvt.c uses flow_loops_find to find loops
> and mark all loop exit edges:
> 
>   if ((! targetm.cannot_modify_jumps_p ())
>   && (!flag_reorder_blocks_and_partition || !no_new_pseudos
>   || !targetm.have_named_sections))
> {
>   struct loops loops;
> 
>   flow_loops_find (&loops);
>   mark_loop_exit_edges (&loops);
>   flow_loops_free (&loops);
>   free_dominance_info (CDI_DOMINATORS);
> }
> 
> I was wondering why we would sometimes *not* mark exit edges, but then
> I remembered that for some reason flow_loops_find modifies the CFG,
> which may lead to problems that we have to work around here.
> 
> But if we do not mark loop exit edges, we can sometimes end up doing
> unprofitable if-conversions!
> 
> It seems to me that a function called "flow_loops_find" is supposed to
> do *just* analysis, and not transformations.  Apparently it now first
> transforms all loops into some canonical form, but that is completely
> inappropriate and unnecessary for some users of this loops analysis.
> 
> Is this something that could be easily fixed?  E.g. can we make it
> that flow_loops_find only performs transformations if asked to (by
> adding a function argument for that)?

yes.  However, you won't be able to use most of the functions for work
with loops then.  mark_loop_exit_edges should work, though.

Zdenek


Re: GCC optimizes integer overflow: bug or feature?

2006-12-19 Thread Zdenek Dvorak
Hello,

> >Now, if your argument is that following the LIA-1 standard will prevent 
> >optimizations that could otherwise be made if one followed only the C 
> >standard, that's a reasonable argument, but it should not be couched as 
> >if it implies that preventing the optimizations would not be following 
> >the C standard.
> 
> I continue to suspect that the gains from this optimization are minimal
> to non-existent in practice.

the majority of loop optimizations work only assuming that induction
variables do not overflow.  This assumption can sometimes be verified
using quite expensive analyses of the loop code, or in majority of cases
easily derived if the induction variables are either signed or pointers
(where the standard allows us to assume that the arithmetics does not
overflow).

Note that there was even a serious discussion about enabling
-funsafe-loop-optimizations by default at some optimization level, thus
making us assert that *no* induction variables (regardless of
signedness) overflow; which IIRC is what some other compilers do.

IMHO, using loops relying on the behavior of overflow of an
induction variable (*) is an ugly hack and whoever writes such a code
does not deserve for his program to work.

Zdenek

(*) especially without bothering to verify what the language standard
says about that


Re: GCC optimizes integer overflow: bug or feature?

2006-12-20 Thread Zdenek Dvorak
Hello,

> > Paul Brook wrote:
> > >> Compiler can optimize it any way it wants,
> > >> as long as result is the same as unoptimized one.
> > > 
> > > We have an option for that. It's called -O0.
> > > 
> > > Pretty much all optimization will change the behavior of your program.
> > 
> > Now that's a bit TOO strong a statement, critical optimizations like
> > register allocation and instruction scheduling will generally not change
> > the behavior of the program (though the basic decision to put something
> > in a register will, and *surely* no one suggests avoiding this critical
> > optimization).
> 
> Actually they will with multi threaded program, since you can have a case
> where it works and now it is broken because one thread has speed up so much it
> writes to a variable which had a copy on another thread's stack.  So the 
> argument
> about it being too strong is wrong because timming matters now a days.  
> Instruction
> scheduling can cause the same issue as it forces a write too early for 
> another thread
> to act on.

actually, you do not even need (invalid) multithreaded programs to
realize that register allocation may change behavior of a program.
If the size of the stack is bounded, register allocation may
cause or prevent program from running out of stack, thus turning a
crashing program to non-crashing one or vice versa.

Zdenek


Re: SSA_NAMES: should there be an unused, un-free limbo?

2006-12-21 Thread Zdenek Dvorak
Hello,

> > > > The situation is that some SSA_NAMEs are disused (removed from the
> > > > code) without being released onto the free list by
> > > > release_ssa_name().
> > > >
> > > Yes, it happens if a name is put into the set of names to be updated by
> > > update_ssa.
> > >
> > > After update_ssa, it should be true that every SSA name with no
> > > SSA_NAME_DEF_STMT is in the free list.
> > 
> > In this case this isn't true, because we have code that orphans ssa
> > names without releasing them.
> > I'm sure Robert will explain further details in a few moments :)
> True.  But remember, the stated purpose of the SSA_NAME recycling
> code was _not_ to track every SSA_NAME that went "dead" and recycle
> it, but instead to get the majority of them (and to ultimately save
> memory by recycling them).  Orphan SSA_NAME were always expected.
> 
> If we wanted to add a new policy that all dead SSA_NAMEs must be
> released to the recycling code, then that's fine, I'm not going
> to object.

I think this might be a good idea.  I think that this requires
a lot of changes (basically going through all uses of bsi_remove
and remove_phi_node and checking them), but it would be cleaner
than the current situation.

It should be however easy to add verification to verify_ssa, so
this project should be fairly straightforward (I hope).

Zdenek


Re: SSA_NAMES: should there be an unused, un-free limbo?

2006-12-22 Thread Zdenek Dvorak
Hello,

> On Thu, 2006-12-21 at 20:18 +0100, Zdenek Dvorak wrote:
> 
> > I think this might be a good idea.  I think that this requires
> > a lot of changes (basically going through all uses of bsi_remove
> > and remove_phi_node and checking them), but it would be cleaner
> > than the current situation.
> Agreed.  Tedious work, but it shouldn't be terribly difficult (famous
> last words).

I will give it a try (i.e., if it can be done in one afternoon, I will
send a patch tomorrow :-).

Zdenek


Re: SSA_NAMES: should there be an unused, un-free limbo?

2006-12-24 Thread Zdenek Dvorak
Hello,

> > > I think this might be a good idea.  I think that this requires
> > > a lot of changes (basically going through all uses of bsi_remove
> > > and remove_phi_node and checking them), but it would be cleaner
> > > than the current situation.
> > Agreed.  Tedious work, but it shouldn't be terribly difficult (famous
> > last words).
> 
> I will give it a try (i.e., if it can be done in one afternoon, I will
> send a patch tomorrow :-).

here is a patch.  It passes bootstrap & regtesting except for one
testcase (g++.dg/opt/reg-stack2.C); I do not have time to work further
on it now.

As expected, more complications than I believed appeared.  The changes
to bsi_remove and release_defs would be basically sufficient for ssa
names for real operands, however we are losing ssa names for virtual
operands everywhere, and on several places relying on that they are not
released.

One problem with the patch seems to be that it appears to increase
compile times significantly (by some 4% on compiling preprocessed gcc
sources), I did not investigate where does this slowdown come from.

Zdenek

Index: tree-ssa-loop-im.c
===
*** tree-ssa-loop-im.c  (revision 120177)
--- tree-ssa-loop-im.c  (working copy)
*** free_mem_ref_locs (struct mem_ref_loc *m
*** 897,912 
  static void
  rewrite_mem_refs (tree tmp_var, struct mem_ref_loc *mem_refs)
  {
-   tree var;
-   ssa_op_iter iter;
- 
for (; mem_refs; mem_refs = mem_refs->next)
  {
!   FOR_EACH_SSA_TREE_OPERAND (var, mem_refs->stmt, iter, 
SSA_OP_ALL_VIRTUALS)
!   mark_sym_for_renaming (SSA_NAME_VAR (var));
! 
*mem_refs->ref = tmp_var;
!   update_stmt (mem_refs->stmt);
  }
  }
  
--- 897,907 
  static void
  rewrite_mem_refs (tree tmp_var, struct mem_ref_loc *mem_refs)
  {
for (; mem_refs; mem_refs = mem_refs->next)
  {
!   push_stmt_changes (&mem_refs->stmt);
*mem_refs->ref = tmp_var;
!   pop_stmt_changes (&mem_refs->stmt);
  }
  }
  
Index: tree-complex.c
===
*** tree-complex.c  (revision 120177)
--- tree-complex.c  (working copy)
*** expand_complex_move (block_stmt_iterator
*** 776,781 
--- 776,783 
x = build2_gimple (GIMPLE_MODIFY_STMT, x, r);
bsi_insert_before (bsi, x, BSI_SAME_STMT);
  
+   push_stmt_changes (&stmt);
+ 
if (stmt == bsi_stmt (*bsi))
{
  x = build1 (IMAGPART_EXPR, inner_type, unshare_expr (lhs));
*** expand_complex_move (block_stmt_iterator
*** 794,800 
}
  
update_all_vops (stmt);
!   update_stmt (stmt);
  }
  }
  
--- 796,802 
}
  
update_all_vops (stmt);
!   pop_stmt_changes (&stmt);
  }
  }
  
Index: tree-ssa-loop-manip.c
===
*** tree-ssa-loop-manip.c   (revision 120177)
--- tree-ssa-loop-manip.c   (working copy)
*** verify_loop_closed_ssa (void)
*** 423,429 
if (current_loops == NULL)
  return;
  
!   verify_ssa (false);
  
FOR_EACH_BB (bb)
  {
--- 423,429 
if (current_loops == NULL)
  return;
  
!   verify_ssa (false, false);
  
FOR_EACH_BB (bb)
  {
Index: tree-tailcall.c
===
*** tree-tailcall.c (revision 120177)
--- tree-tailcall.c (working copy)
*** eliminate_tail_call (struct tailcall *t)
*** 752,758 
break;
  
bsi_remove (&bsi, true);
-   release_defs (t);
  }
  
/* Number of executions of function has reduced by the tailcall.  */
--- 752,757 
*** eliminate_tail_call (struct tailcall *t)
*** 798,804 
  }
  
bsi_remove (&t->call_bsi, true);
-   release_defs (call);
  }
  
  /* Add phi nodes for the virtual operands defined in the function to the
--- 797,802 
Index: tree-ssa-dse.c
===
*** tree-ssa-dse.c  (revision 120177)
--- tree-ssa-dse.c  (working copy)
*** dse_optimize_stmt (struct dom_walk_data 
*** 608,617 
  
  /* Remove the dead store.  */
  bsi_remove (&bsi, true);
- 
- /* And release any SSA_NAMEs set in this statement back to the
-SSA_NAME manager.  */
- release_defs (stmt);
}
  
record_voperand_set (dse_gd->stores, &bd->stores, ann->uid);
--- 608,613 
Index: tree-ssa-alias.c
===
*** tree-ssa-alias.c(revision 120177)
--- tree-ssa-alias.c(working copy)
*** compute_may_aliases (void)
*** 943,953 
{
  block_stmt_iterator bsi;
  basic_block bb;
  FOR_EACH_BB (bb)
{
  for (bsi = bsi_start (bb); !bsi_end_p (bsi); bsi_next (&bsi))
{
! upda

Re: changing "configure" to default to "gcc -g -O2 -fwrapv ..."

2006-12-31 Thread Zdenek Dvorak
Hello,

> >I have been looking into infer_loop_bounds_from_signedness() called
> >from infer_loop_bounds_from_undefined().
> >At some places, nowrap_type_p() is used but this function operates
> >only on types, so there will be too many false positive there; yet we
> >will miss warning through that case.
> 
> I don't know that area too well, but I think we are already issuing a
> warning if we use -funsafe-loop-optimizations, so it might be possible
> to do the same if we use signed wrapping undefinedness.

not quite; -funsafe-loop-optimizations is not used in scev analysis
itself, only in # of iterations analysis.  At some extra compile-time
cost, you can check whether we use undefinedness of signed overflow
throughout the scev analysis, but there is no way how to avoid duplicate
warnings, and even making the positioning and contents of the warnings
sensible would be nontrivial.

Zdenek


Re: GCC trunk revision 120704 failed to build spec cpu2k/gcc

2007-01-12 Thread Zdenek Dvorak
Hello,

> > GCC trunk revision 120704 failed to build SPEC cpu2000/gcc on -O1 and 
> > higher optimization level at x86_64-redhat-linux.
> > 
> > reload1.c: In function 'reload':
> > reload1.c:449: error: address taken, but ADDRESSABLE bit not set
> > bad_spill_regs
> > 
> > reload1.c:449: error: address taken, but ADDRESSABLE bit not set
> > bad_spill_regs
> > 
> > reload1.c:449: internal compiler error: verify_stmts failed
> > Please submit a full bug report,
> > with preprocessed source if appropriate.
> > See http://gcc.gnu.org/bugs.html> for instructions.
> > 
> > Does anybody see the same?

I can reproduce this on i686.  I will check what the problem is,

Zdenek


Re: GCC trunk revision 120704 failed to build spec cpu2k/gcc

2007-01-12 Thread Zdenek Dvorak
Hello,

> > > GCC trunk revision 120704 failed to build SPEC cpu2000/gcc on -O1 and 
> > > higher optimization level at x86_64-redhat-linux.
> > > 
> > > reload1.c: In function 'reload':
> > > reload1.c:449: error: address taken, but ADDRESSABLE bit not set
> > > bad_spill_regs
> > > 
> > > reload1.c:449: error: address taken, but ADDRESSABLE bit not set
> > > bad_spill_regs
> > > 
> > > reload1.c:449: internal compiler error: verify_stmts failed
> > > Please submit a full bug report,
> > > with preprocessed source if appropriate.
> > > See http://gcc.gnu.org/bugs.html> for instructions.
> > > 
> > > Does anybody see the same?
> 
> I can reproduce this on i686.  I will check what the problem is,

this is probably due to some of the recent IPA changes.
ipa-reference.c:analyze_function does not traverse phi nodes,
and thus misses that the address of bad_spill_regs is taken inside the
phi node (and clears the addressable flag from it, causing an ICE
later).  I am working on the patch.

Zdenek


Cheatsheet for building gcc

2007-01-15 Thread Zdenek Dvorak
Hello,

with the changes to the build system in the last few months, I am having
problems to do some things I am used to do during development.  I found
more or less satisfactory procedures for most them, however, I would
like to know the "official" way how to achieve those effects (and to put
the created "cheatsheet" to wiki, so that people do not have to reinvent
the wheel).  All the solutions should be for gcc configured without any
special options (except for --enable-languages and --target):

Example:

Q1) How to bootstrap the compiler, including all libraries?
A1) "make" from the top directory

Could some of the build system experts please provide the answers
for the following questions?

Q2) How to bootstrap the compiler, but not build any time consuming
libraries (libjava, libstdc++)?

Q3) How to compile just the stage1 compiler + the basic runtime
(libgcc, libmudflap, libgomp)?

Q4) How to compile all the libraries (libjava, ...) using the stage1
compiler?

Q5) How to clean up everything (including the stage1 compiler, but
excluding makefiles, so that I can build everything from scratch,
but configure does not have to be run)?

Q6) How to clean up everything except for the stage1 compiler and
its runtime (libgcc, libmudflap, libgomp)?

And of course, how to do any other useful things that I have forgotten
about...

Zdenek


Re: Cheatsheet for building gcc

2007-01-16 Thread Zdenek Dvorak
Hello,

thanks a lot for the answers!

> >Q4) How to compile all the libraries (libjava, ...) using the stage1
> >compiler?
> 
> A4) Configure with --enable-stage1-languages=all

Is there a way how to do it without reconfiguring gcc?  This question
aims at the following situation: I already have compiled the stage1
compiler with a patch, and I have learned from some external source
that the compiler ICEs during the compilation of libjava.  Since it
is highly unlikely that bootstrapping the compiler would matter,
I would like to be able to just start building libjava using stage1
compiler.  Reconfiguring gcc and building it from scratch takes
some time, so I would like to avoid that.

> >Q6) How to clean up everything except for the stage1 compiler and
> >its runtime (libgcc, libmudflap, libgomp)?
> 
> make restrap (which will also start a bootstrap)

Would it be possible to add a possibility to just clean up the things,
without restarting the bootstrap (i.e., to move everything to the
state I would have after "make stage1-bubble" if I started from the scratch)?

Zdenek

> >And of course, how to do any other useful things that I have forgotten
> >about...
> 
> I would like to ask if, in your opinion, libssp, libmudflap and libgomp 
> should be bootstrapped like libgcc is.  I would think that, for 
> automatic parallelization, you'd need at least libgomp.
> 
> In this case, the answer to Q3 would be simply "make stage1-bubble".
> 
> Paolo


Re: Cheatsheet for building gcc

2007-01-16 Thread Zdenek Dvorak
Hello,

> >Is there a way how to do it without reconfiguring gcc?
> 
> No.  Do you think it would be best to have --enable-stage1-languages=all 
> in non-release branches?  The time taken to compile the stage1 
> (nonoptimized) non-C front-ends is small, and the advantage seems 
> significant.

if you configure with  --enable-stage1-languages=all,

1) will all the libraries be built three times during bootstrap, or just
   once?
2) is it still possible to just build the stage1 compiler without the
   libraries?

> >>>Q6) How to clean up everything except for the stage1 compiler and
> >>>   its runtime (libgcc, libmudflap, libgomp)?
> >>make restrap (which will also start a bootstrap)
> >
> >Would it be possible to add a possibility to just clean up the things,
> >without restarting the bootstrap (i.e., to move everything to the
> >state I would have after "make stage1-bubble" if I started from the 
> >scratch)?
> 
> Actually, I was wrong as "make restrap" will also delete the runtime. 
> The closest things to what you want are:
> - "make distclean-stage2"; however it will leave there all the target 
> libraries if they have already been compiled, not only the runtime.
> - "make distclean-stage2 clean-target"; which will remove also the runtime.

thanks, the "make distclean-stage2 clean-target" actually is exactly
what I need (although I specified something different :-)

Zdenek



Re: 2007 GCC Developers Summit

2007-01-29 Thread Zdenek Dvorak
Hello,

> >> One idea I've always pondered is to have brief (perhaps 1-2 hr)
> >> tutorials, given by people in their area of expertise, as a means for
> >> other developers to come up to speed on a topic that interests them.  Is
> >> this something that appeals to others?
> >>
> >Sounds good to me.  For instance, the new java front end, a description
> >of the new build system, etc.
> 
> Although it's not something new,
> I'd be interested in a tutorial on loop optimizations (IV opt and all 
> related).
> Based on my understanding, the topic might be of interest
> to some other people as well.

if sufficiently many people are interested, I (hopefully with the help
of other loop optimizer developers) can prepare something.

Zdenek


Re: Disabling bootstrap

2007-02-02 Thread Zdenek Dvorak
Hello,

> I've changed gcc by adding a new pass, however, when I compile gcc,
> during compilation it calls itself, so I disabled bootstrap but that
> is still happening even during bootstrap. Is there any way to compile
> gcc without the final gcc compiling something?

make stage1-bubble.  See also
http://gcc.gnu.org/wiki/Top-Level_Bootstrap

> Moreover, how can I add a flag to gcc so that it enables/disables a given 
> pass?

Add the flag to common.opt, and test it in the gate function of your
pass.

Zdenek


Re: Scheduling an early complete loop unrolling pass?

2007-02-05 Thread Zdenek Dvorak
Hello,

> >As we also only vectorize innermost loops I believe doing a
> >complete unrolling pass early will help in general (I pushed
> >for this some time ago).
> >
> >Thoughts?
> 
> It might also hurt, though, since we don't have a basic block 
> vectorizer.  IIUC the vectorizer is able to turn
> 
>   for (i = 0; i < 4; i++)
> v[i] = 0.0;
> 
> into
> 
>   *(vector double *)v = (vector double){0.0, 0.0, 0.0, 0.0};

I intentionally put the cunroll pass after vectorizer to enable this.  I
guess we might choose to rather unroll the loop, and leave creation of
the vector operations on some kind of later combine/straight-line code
vectorization pass.

Zdenek


Re: Scheduling an early complete loop unrolling pass?

2007-02-06 Thread Zdenek Dvorak
Hello,

> Ira Rosen/Haifa/IBM wrote on 06/02/2007 11:49:17:
> 
> > Dorit Nuzman/Haifa/IBM wrote on 05/02/2007 21:13:40:
> >
> > > Richard Guenther <[EMAIL PROTECTED]> wrote on 05/02/2007 17:59:00:
> > >
> ...
> > >
> > > That's going to change once this project goes in: "(3.2) Straight-
> > > line code vectorization" from http://gcc.gnu.
> > > org/wiki/AutovectBranchOptimizations. In fact, I think in autovect-
> > > branch, if you unroll the above loop it should get vectorized
> > > already. Ira - is that really the case?
> >
> > The completely unrolled loop will not get vectorized because the
> > code will not be inside any loop (and our SLP implementation will
> > focus, at least as a first step, on loops).
> 
> Ah, right... I wonder if we can keep the loop structure in place, even
> after completely unrolling the loop  - I mean the 'struct loop' in
> 'current_loops' (not the actual CFG), so that the "SLP in loops" would have
> a chance to at least consider vectorizing this "loop". Zdenek - what do you
> say?

I do not think this is a good idea -- making the structures inconsistent
just to "fix" a pass that can be easily fixed in other way.

Zdenek

> thanks,
> dorit
> 
> > The following will get vectorized (without permutation on autovect
> > branch, and with redundant permutation on mainline):
> >
> > for (i = 0; i < n; i++)
> >   {
> > v[4*i] = 0.0;
> > v[4*i + 1] = 0.0;
> > v[4*i + 2] = 0.0;
> > v[4*i + 3] = 0.0;
> >   }
> >
> > The original completely unrolled loop will get vectorized if it is
> > encapsulated in an outer-loop, like so:
> >
> > for (j=0; j >   {
> >  for (i = 0; i < 4; i++)
> >  v[i] = 0.0;
> >  v += 4;
> >   }
> >
> > Ira


Re: RFC: vectorizer cost model

2007-02-16 Thread Zdenek Dvorak
Hello,

> "Linthicum, Tony" <[EMAIL PROTECTED]> writes:
> 
> >   * Would using a tree-level API like estimate_num_insns be superior
> > to using a simple cost of 1 per statment?
> 
> For this sort of calculation, I would guess not.  I would imagine that
> on most processors the cost of a single vector instruction is
> comparable to the cost of the same operation on scalar operands.  So
> simply counting instructions should give you a good indication of
> whether vectorization is profitable, even though each instruction will
> of course have a different cost.
> 
> If this turns out not to be the case, then I think the next step would
> be to have a target defined multiplier for vector instructions,
> indicating that for that target vector instructions were more or less
> expensive than scalar instructions according to some ratio.
> 
> I doubt that using the full fledged cost infrastructure would give you
> better results than that in practice.

on the other hand, using it does not cost you anything, and would avoid
risk of creating yet another code duplication.

> >   * What is the best way to access target level cost information?
> 
> I'm sure you know that the loop code does this by generating RTL and
> asking for the cost of it (computation_cost in tree-ssa-loop-ivopts.c).

Which should be removed (I have some patches for that, but I somehow
forgot on them because of other issues).

Zdenek


Re: Preserving alias analysis information

2007-02-19 Thread Zdenek Dvorak
Hello,

> For instance, let's consider the following struct definition (taken from
> gcc.dg/struct-alias-1.c):
> 
> struct S {
>int a[3];
>int x;
> };
> 
> This is the original code in GIMPLE pseudo-C dump representation:
> 
>   s.x = 0;
>   i.0 = i;
>   s.a[i.0] = 1;
>   D.1416 = s.x;
>   if (D.1416 != 0) goto ; else goto ;
> :;
>   link_error ();
> 
> This is the code after the CLI-specific array simplification:
> 
>   s.x = 0;
>   i.0 = i;
>   cilsimp.1 = &s.a[0];
>   D.1423 = i.0 * 4;
>   D.1424 = D.1423 + cilsimp.1;
>   *D.1424 = 1;
>   D.1416 = s.x;
>   if (D.1416 != 0) goto ; else goto ;
> :;
>   link_error ();
> 
> In the original code, GCC alias analysis understands that accesses to 
> s.x and s.a do not alias; therefore, it understands that the "then" 
> condition of the if statement is never taken.
> In the modified code, the alias analysis conclude that access to s.x and 
> pointer D.1424 may alias.
> My question is: is this unavoidable because of the memory access pattern 
> in the modified code, or was there additional information the 
> transformation pass could have attached to D.1424 (or to something else) 
> that would have have excluded such a memory alias?

you might try turning the references to TARGET_MEM_REFs, and copy the
alias information using copy_ref_info to it.  I am not sure how that
would interact with the transformations you want to do, but we do lot
of magic to keep the virtual operands for TARGET_MEM_REFs the same
as before the transformation (unless that got broken in last few months,
which unfortunately is pretty likely).

Zdenek


Re: Preserving alias analysis information

2007-02-19 Thread Zdenek Dvorak
Hello,

> >you might try turning the references to TARGET_MEM_REFs, and copy the
> >alias information using copy_ref_info to it.  I am not sure how that
> >would interact with the transformations you want to do, but we do lot
> >of magic to keep the virtual operands for TARGET_MEM_REFs the same
> >as before the transformation (unless that got broken in last few months,
> >which unfortunately is pretty likely).
> 
> It would be better to annotate things with better alias information
> than transform into target specific trees, which none of the other
> transformations actually know how to deal with.

well, I had impression that what he does is some target-specific
transformation, so this idea did not seem all that out of place to me.

Zdenek


Re: Preserving alias analysis information

2007-02-20 Thread Zdenek Dvorak
Hello,

> >>>you might try turning the references to TARGET_MEM_REFs, and copy the
> >>>alias information using copy_ref_info to it.  I am not sure how that
> >>>would interact with the transformations you want to do, but we do lot
> >>>of magic to keep the virtual operands for TARGET_MEM_REFs the same
> >>>as before the transformation (unless that got broken in last few months,
> >>>which unfortunately is pretty likely).
> >>It would be better to annotate things with better alias information
> >>than transform into target specific trees, which none of the other
> >>transformations actually know how to deal with.
> >
> >well, I had impression that what he does is some target-specific
> >transformation, so this idea did not seem all that out of place to me.
> 
> Thanks Daniel and Zdenek for your suggestions!
> 
> In principle, I don't see anything forbidding Zdenek's idea.
> Unfortunately, what I avoided to mention is that TARGET_MEM_REF nodes 
> are also transformed into pointer arithmetics
> and the equivalent 
> INDIRECT_REF memory access... therefore, this is not an option even only 
> because of that.

hmm... why you do that?  Could you please describe more precisely what
are you trying to achieve?

> The first time this CLI-specific transformation is performed is before 
> GIMPLE code enters SSA form

This looks like a wrong place; I guess later optimizations will in
general try to revert the trees to the original form (at the moment,
we do not have tree-combine pass, but if we had, it definitely would).
IMHO, it would make more sense to do this kind of target specific
transformations as late as possible.

Anyway, if that is the case, using TMRs is not a good idea.

Zdenek


Re: Preserving alias analysis information

2007-02-20 Thread Zdenek Dvorak
Hello,

> >>In principle, I don't see anything forbidding Zdenek's idea.
> >>Unfortunately, what I avoided to mention is that TARGET_MEM_REF nodes 
> >>are also transformed into pointer arithmetics
> >>and the equivalent 
> >>INDIRECT_REF memory access... therefore, this is not an option even only 
> >>because of that.
> >
> >hmm... why you do that?  Could you please describe more precisely what
> >are you trying to achieve?
> 
> Sure!
> The short answer is that, though most GIMPLE tree codes closely match 
> what is representable in the CIL bytecode, some do not; hence, such 
> codes are "lowered" into equivalent expressions that directly match what 
> is representable in the bytecode.

this seems quite close to what TARGET_MEM_REFs are designed for.  IMHO,
the best way would be to lower the memory references to TARGET_MEM_REFs
(*) just once, sometime quite late in the optimization pipeline (just after
loop optimizations, for example), so that high-level optimizers/alias
analysis see the easy to understand code, while at least the essential
cleanups are performed on the lower level code.

Zdenek

(*) or just the pointer arithmetics like you do now, if you have some
reasons for avoiding TMRs, although then you have to rerun the pass once
later to get rid of invalid forms that may be created by the optimizers.

> The long answer is in "CIL simplification pass" section at 
> http://gcc.gnu.org/projects/cli.html :-)
> 
> >>The first time this CLI-specific transformation is performed is before 
> >>GIMPLE code enters SSA form
> >
> >This looks like a wrong place; I guess later optimizations will in
> >general try to revert the trees to the original form (at the moment,
> >we do not have tree-combine pass, but if we had, it definitely would).
> >IMHO, it would make more sense to do this kind of target specific
> >transformations as late as possible.
> 
> Yes, this is precisely the reason behind running this transformation 
> pass twice.
> The first pass (before SSA) simplifies GIMPLE expressions not atomically 
> representable in the bytecode, in the hope that GCC middle-end passes 
> optimize them further.
> The second pass makes sure to re-simplify expressions that may have been 
> reverted to the original form. The practice shows this happens quite 
> rarily; still, it's a possibility that must be taken into account.
> The first pass is optional, the second is strictly required.
> 
> My experiments also show that the bytecode generated by running the CIL 
> simplification twice as opposed to running the final pass only is 
> significantly better. For multimedia codecs with heavy array accesses 
> (which, by the way, is the kind of code ST is particularly interested 
> in), the recorded difference in performance is up to 40%.
> 
> >Anyway, if that is the case, using TMRs is not a good idea.
> >
> >Zdenek
> 
> Cheers,
> Roberto


Re: Preserving alias analysis information

2007-02-20 Thread Zdenek Dvorak
Hello,

> >this seems quite close to what TARGET_MEM_REFs are designed for.  IMHO,
> >the best way would be to lower the memory references to TARGET_MEM_REFs
> >(*) just once, sometime quite late in the optimization pipeline (just after
> >loop optimizations, for example), so that high-level optimizers/alias
> >analysis see the easy to understand code, while at least the essential
> >cleanups are performed on the lower level code.
> 
> but the only kind of TARGET_MEM_REFs we would really allow would be 
> those with a base and no symbol, no index, no step, no offset... which 
> look to me like INDIRECT_REFs!

well, if that acurately describes the capabilities of the target, yes.
This way, you will prevent optimizers from recombining the addresses
to more complex ones again.

Zdenek


Re: GCC 4.3 Stage 1 project survey

2007-02-26 Thread Zdenek Dvorak
Hello,

> I've been trying to keep the GCC_4.3_Release_Planning wiki page up to
> date, and I'd like to be sure I haven't missed anything going in.  I've
> moved several projects to the Completed section, and if I've done
> anything in error there, please correct it.
>
> So here's a survey of what's left:
>
> * PredictiveCommoning
> 
>   I can't tell for sure, as Zdenek commits so many patches ;-)  It seems
>   that this one is not merged yet.

the final version of the patch is at
http://gcc.gnu.org/ml/gcc-patches/2007-02/msg01385.html
waiting for review.

Zdenek


Re: Valid gimple for MEM_REF

2007-03-03 Thread Zdenek Dvorak
Hello,

> I noticed that gcc.dg/tree-ssa/loop-19.c was failing on both
> powerpc-linux-gnu and powerpc64-linux-gnu:
> FAIL: gcc.dg/tree-ssa/loop-19.c scan-tree-dump-times MEM.(base: &|symbol: 
> )a, 2
> FAIL: gcc.dg/tree-ssa/loop-19.c scan-tree-dump-times MEM.(base: &|symbol: 
> )c, 2
> 
> The reason why they are failing is because we produce:
> MEM[base: (double *) &c, index: ivtmp.34] = MEM[base: (double *) &a,
> index: ivtmp.34];
> Which does not match the regex as there is a cast there.
> Now the real question comes down to, is the following valid gimple
> that IV-OPTS produces:
> MEM[base: (double *) &a, index: ivtmp.34_12];
> 
> base is now a non gimple invariant but instead is a full expression.
> If we decide to do any other optimizations with MEM_REF, we might run
> into more of these issues?
> 
> So what are the constraints on MEM_REF's base argument, is it a simple
> expression (SSA_name or invariant) or can it be a complex expression?

only gimple_vals (name or invariant).  However, the expressions are
matched in final_cleanup dump (after out-of-ssa and ter), so this no
longer is the case.  I think just the regular expressions need to be
updated.

Zdenek


Re: Valid gimple for MEM_REF

2007-03-04 Thread Zdenek Dvorak
Hello,

> > only gimple_vals (name or invariant).  However, the expressions are
> >matched in final_cleanup dump (after out-of-ssa and ter), so this no
> >longer is the case.  I think just the regular expressions need to be
> >updated.
> 
> Then IV-OPTs has an issue too but IV-OPTs dump gives:
>  D.1604_5 = MEM[base: (double *) &a, index: ivtmp.34_12];
>  MEM[base: (double *) &c, index: ivtmp.34_12] = D.1604_5;
> 
> the expression matching in final_cleanup was just a symptom of the issue.

OK, I will have a look at that.

Zdenek


Proposal: changing representation of memory references

2007-04-04 Thread Zdenek Dvorak
Hello,

at the moment, any pass that needs to process memory references are
complicated (or restricted to handling just a limited set of cases) by
the need to interpret the quite complex representation of memory
references that we have in gimple.  For example, there are about 1000 of
lines of quite convoluted code in tree-data-ref.c and about 500 lines in
tree-ssa-loop-ivopts.c dealing with parsing and analysing memory
references.

I would like to propose (and possibly also work on implementing) using a
more uniform representation, described in more details below.  The
proposal is based on the previous discussions
(http://gcc.gnu.org/ml/gcc/2006-06/msg00295.html) and on what I learned
about the way memory references are represented in ICC.
It also subsumes the patches of Daniel to make p[i] (where p is pointer)
use ARRAY_REFs/MEM_REFs.

I am not sure whether someone does not already work on something
similar, e.g. with respect to LTO (where the mentioned discussion
started), or gimple-tuples branch?

Proposal:

For each memory reference, we remember the following information:

-- base of the reference
-- constant offset
-- vector of indices
-- type of the accessed location
-- original tree of the memory reference (or another summary of the
  structure of the access, for aliasing purposes)
-- flags

for each index, we remeber
-- lower and upper bound
-- step
-- value of the index

The address of the reference can be computed as

base + offset + sum_{idx} offsetof(idx)

where offsetof(idx) = (value - lower) * step

For example, a.x[i][j].y  would be represented as
base = &a
offset = offsetof (x) + offsetof (y);
indices:
  lower = 0 upper = ? step = sizeof (a.x[i]) value = i
  lower = 0 upper = ? step = sizeof (a.x[j]) value = j

p->x would be represented as
base = p;
offset = offsetof (x);
indices: empty

p[i] as
base = p;
offset = 0
indices: 
  lower = 0 upper = ? step = sizeof (p[i]) value = i

Remarks:
-- it would be guaranteed that the indices of each memory reference are
   independent, i.e., that &ref[idx1][idx2] == &ref[idx1'][idx2'] only
   if idx1 == idx1' and idx2 = idx2'; this is important for dependency
   analysis (and for this reason we also need to remember the list of
   indices, rather than trying to reconstruct them from the address).
-- it would be guaranteed that the computations of the address do not
   overflow.
-- possibly there could be a canonicalization pass that, given

   for (p = q; p < q + 100; p++)
 p->x = ...  {represented the way described above}

   would transform it to 

   for (p = q, i = 0; p < q + 100; p++, i++)
 {base = q, offset = offsetof(x), indices: lower = 0 upper = ? step = 
sizeof (*p) value = i}

   so that dependence analysis and friends do not have to distinguish
   between accesses through pointers and arrays

Zdenek



Re: Proposal: changing representation of memory references

2007-04-04 Thread Zdenek Dvorak
Hello,

> This looks like a very complicated (though very generic) way of
> specifying a memory
> reference.  Last time we discussed this I proposed to just have BASE, OFFSET
> and accessed TYPE (and an alias tag of the memory reference).  I realize 
> this
> doesn't cover accesses to multi-dimensional arrays, but using the
> full-blown scheme
> in place of a simple INDIRECT_REF makes memory usage for the trees go up
> noticable.

as was pointed out before, you cannot avoid remembering the indices of
the memory reference in some way if you want to be able to do dependency
analysis.  Also, many high-level optimizations work with
multidimensional arrays, so the representation should make this
possible.

Regarding the memory consumption, let us forget the part of the
information for alias analysis (that for simplicity proposes preserving
the current representation, but that can be changed).  Then, the proposed
representation still is cheaper than the current way (component_refs are
collapsed into a single offset field; only the necessary information is
recorded for indices, whereas now we have redundant information stored
in ARRAY_REFs).

For simple indirect_refs, there are no indices, so the overhead of
the proposal over the base+offset one is basically one pointer.

Zdenek


Re: Proposal: changing representation of memory references

2007-04-04 Thread Zdenek Dvorak
Hello,

> This looks like a very complicated (though very generic) way of
> specifying a memory
> reference.  Last time we discussed this I proposed to just have BASE, OFFSET
> and accessed TYPE (and an alias tag of the memory reference).  I realize 
> this
> doesn't cover accesses to multi-dimensional arrays, but using the
> full-blown scheme
> in place of a simple INDIRECT_REF makes memory usage for the trees go up
> noticable.

by the way, does your claim "using the full-blown scheme... makes memory
usage go up noticeable" mean that you have experimented with some
similar representation?  If so, could you please post your
patches/results?  Or do you have at least some data regarding how great
part of the tree memory consumption comes from memory references?

Zdenek


Re: Proposal: changing representation of memory references

2007-04-04 Thread Zdenek Dvorak
Hello,

> >> This looks like a very complicated (though very generic) way of
> >> specifying a memory
> >> reference.  Last time we discussed this I proposed to just have BASE, 
> >OFFSET
> >> and accessed TYPE (and an alias tag of the memory reference).  I realize
> >> this
> >> doesn't cover accesses to multi-dimensional arrays, but using the
> >> full-blown scheme
> >> in place of a simple INDIRECT_REF makes memory usage for the trees go up
> >> noticable.
> >
> >by the way, does your claim "using the full-blown scheme... makes memory
> >usage go up noticeable" mean that you have experimented with some
> >similar representation?  If so, could you please post your
> >patches/results?  Or do you have at least some data regarding how great
> >part of the tree memory consumption comes from memory references?
> 
> No, I didn't do patches or measurements.  I just extrapolated that if you 
> want
> to handle things like a.x[i].y.z[j] you need to store not only two indices 
> but
> either two strides or two types (to extract the stride).  So you'd
> have (assuming
> a flat tree) for example two VECs in the memory reference tree.  Together
> with the (possibly non-constant) offset and the base these are four pointers
> compared to one in a simple INDIRECT_REF.

to make this comparison fair, you also need to take into account
that the computations need to be done somewhere.  So for the "simple
INDIRECT_REF", you in fact have the following statements in the program
(assuming that lower bound of each array is 0, otherwise you would also
have the subtractions):

off1 = step1 * i;
off2 = step2 * i;
off = off1 + off2;
addr = base + off;
*addr

Given all the temporary variables, expression and assignment nodes
needed for this, this more than makes up for any extra space that is
used in the proposed representation.  Of course, in some situations some
of these computations may be shared.

For simple INDIRECT_REF (without any structure), we store base, plus we
have fields for offset (constant 0) and the vector of indices (NULL).
So there are two extra pointers over the current state.  Without
experimentation, I cannot tell whether overall, my proposal would require more 
or less
memory than we use now.

In fact, if this turned out to be a problem, I can live with allowing
also INDIRECT_REF in its current form, to represent references that do not
have any structure.

> Maybe we can incrementally create out of all the existing memory reference
> two reference trees, one simple that does not handle multi-dimensional array
> accesses well and one full-blown one?

Would not that just mean that you would consume even more memory?  If
you mean incremental lowering, yes, I can imagine we might want to do
that (possibly lowering the references later to the existing
TARGET_MEM_REFs), but it does not really help with the memory
consumption.

Zdenek


Re: Proposal: changing representation of memory references

2007-04-04 Thread Zdenek Dvorak
Hello,

> >Proposal:
> >
> >For each memory reference, we remember the following information:
> >
> >-- base of the reference
> >-- constant offset
> >-- vector of indices
> >-- type of the accessed location
> >-- original tree of the memory reference (or another summary of the
> >  structure of the access, for aliasing purposes)
> This i don't think we should keep, because i don't believe it's
> necessary for aliasing.  At worst, aliasing could come up with it's
> own summary if it really wanted to.

long term, I totally agree with you.  Short term, I think even
implementing the proposal in the current form will be a lot of work; I
somewhat fear having to include having to modify alias analysis
significantly into it (of course, if someone familiar with tree alias
analysis -- i.e., you or Diego -- volunteered to help me with this
part of the change, it would help a lot).

> >-- flags
> >
> >for each index, we remeber
> >-- lower and upper bound
> >-- step
> >-- value of the index
> 
> This seems a lot, however, since most of it can be derived from the
> types, why are we also keeping it in the references.

The lower bound and step may be SSA_NAMEs, see the current ARRAY_REF,
and as such, they need to be stored somewhere in IR (not just in type).

> That is, unless we could share most of the  index struct (upper,
> lower, step)  among expressions that access them (IE make index be
> immutable, and require unsharing and resharing if you want to modify
> the expression).

That appears a bit dangerous to me, I would rather avoid such tricks.

Zdenek


Re: Proposal: changing representation of memory references

2007-04-04 Thread Zdenek Dvorak
Hello,

> >> >-- flags
> >> >
> >> >for each index, we remeber
> >> >-- lower and upper bound
> >> >-- step
> >> >-- value of the index
> >>
> >> This seems a lot, however, since most of it can be derived from the
> >> types, why are we also keeping it in the references.
> >
> >The lower bound and step may be SSA_NAMEs, see the current ARRAY_REF,
> >and as such, they need to be stored somewhere in IR (not just in type).
> >
> >> That is, unless we could share most of the  index struct (upper,
> >> lower, step)  among expressions that access them (IE make index be
> >> immutable, and require unsharing and resharing if you want to modify
> >> the expression).
> >
> >That appears a bit dangerous to me, I would rather avoid such tricks.
> 
> Like Richard, I have doubts you are not going to increase memory if
> you store *this* much in every single memory access.

for structured accesses (ARRAY_REFs etc.) we now need much more memory --
4 pointers, plus overhead of the common tree fields.  My proposal will 
save some memory here.

On unstructured accesses (INDIRECT_REFs), the proposal would indeed
require three extra pointers.  I now think keeping INDIRECT_REFs might
turn out to be necessary, to represent unstructured memory accesses.
Having to deal with INDIRECT_REFs should not complicate optimizers
significantly, so it would not beat the purpose of the patch.

All in all, I believe it will need some experimentation, but we should
not consume more memory than we do now, and possibly we could even save
somewhat.

Zdenek


Re: Proposal: changing representation of memory references

2007-04-04 Thread Zdenek Dvorak
Hello,

> > -- base of the reference
> > -- constant offset
> > -- vector of indices
> > -- type of the accessed location
> > -- original tree of the memory reference (or another summary of the
> >   structure of the access, for aliasing purposes)
> > -- flags
> 
> What do you do with Ada COMPONENT_REFs, at a variable offset?

for concreteness, let us consider a.x{offset: n}

I have considered three possibilities
1) making the offset part of the expression for base, i.e., having
   tmp = &a + n
   and reference with base: tmp
2) allowing the offset in the reference to be non-constant, i.e., having
   reference with
   base: &a, offset: n
3) making this offset into an index, i.e, having
   base: &a, indices: (step:1, value: n)

Out of these, I like 3) the most, although it might be fairly expensive
memory-wise (any idea how common the variable offsets are?)

Zdenek


Re: Proposal: changing representation of memory references

2007-04-04 Thread Zdenek Dvorak
Hello,

> >at the moment, any pass that needs to process memory references are
> >complicated (or restricted to handling just a limited set of cases) by
> >the need to interpret the quite complex representation of memory
> >references that we have in gimple.  For example, there are about 1000 of
> >lines of quite convoluted code in tree-data-ref.c and about 500 lines in
> >tree-ssa-loop-ivopts.c dealing with parsing and analysing memory
> >references.
> 
> I have my doubts changing this is the correct step forward.  Having a
> high level representation is a good thing and having it represented as
> *(base+offset) (plus other info) might just make the other passes that
> like the high level representation get worse.  I also don't see why
> tree-data-ref.c and tree-ssa-loop-ivopts.c could not use
> get_inner_reference which parses the memory references for you.

for many reasons.  Probably the most important is that you could not
recover indices of multidimensional arrays this way, thus making
dependency analysis impossible.  Also, with pointer arithmetics, it is
not sufficient just to call get_inner_reference, you also need to
traverse SSA chains.  Also, time and memory-wise, calling
get_inner_reference is fairly expensive.

Note that the representation I propose is not really low-level.  It
carefully preserves all the information you could get from the
"high-level" representation (indices and their ranges, alias
information).  We already have low-level representation of memory
references (TARGET_MEM_REFs), I do not feel any need to make an
alternative proposal for that at the moment.

> Maybe
> I don't see the benifit in always changing our IR without really
> thinking about the problem and seeing if there are already tools
> (functions) which do the same thing in a common place.

Sorry, but you are completely out here.  I have spent a lot of time
working with the code for dealing with memory references, trying many
different approaches.  Yes, I know about get_inner_reference,
I use it where appropriate, and for many reasons (e.g., those
mentioned above), it does not do the job in general.

Also, I did not just pull the proposal out of thin air.  I have read
previous discussions regarding the topic, and I took the opportunity
to ask how the memory references are represented in ICC (which turns
out to be quite similar to my proposal).

I assume that you are trying to help to get us towards some solution,
however please note that some people could find remarks like this
offensive.

Zdenek


Re: Proposal: changing representation of memory references

2007-04-04 Thread Zdenek Dvorak
Hello,

> >> >> That is, unless we could share most of the  index struct (upper,
> >> >> lower, step)  among expressions that access them (IE make index be
> >> >> immutable, and require unsharing and resharing if you want to modify
> >> >> the expression).
> >> >
> >> >That appears a bit dangerous to me, I would rather avoid such tricks.
> >>
> >> Like Richard, I have doubts you are not going to increase memory if
> >> you store *this* much in every single memory access.
> >
> >for structured accesses (ARRAY_REFs etc.) we now need much more memory --
> >4 pointers, plus overhead of the common tree fields.  My proposal will
> >save some memory here.
> 
> I'm not sure how, since you have more than 4 pointers worth of info in:
> 
> -- base of the reference -> pointer
> -- constant offset -> HOST_WIDE_INT
> -- vector of indices -> embedded vec
> -- type of the accessed location -> Pointer
> -- original tree of the memory reference (or another summary of the
> structure of the access, for aliasing purposes) -> pointer
> -- flags -> No idea
> 
> for each index, we remeber
> -- lower and upper bound -> pointers
> -- step -> pointer
> -- value of the index -> pointer
> 
> What have i misunderstood?

consider a[i][j], on 32 bit machine:

Currently, this is two ARRAY_REFS, 40 bytes each, for total of 80 bytes.
With my proposal, this would be 24 bytes (the items contained in the
base of tree_exp: tree_common, including type+flags; locus, and block),
base, offset, summary, vector (16 bytes), in the vector size and bound
(8 bytes), plus 2 indices (16 bytes each), for total of 80 bytes.

I.e., we get the same memory consumption for references with at least
two levels.  For each other index or component ref, the proposed
representation saves 40 - 16 = 24 bytes.

There are some possible improvements:
-- if we used tail array for the indices instead of vector, we could
   save some 8 bytes
-- assuming that the case that the bounds or step of the array are
   varying is infrequent, we can change the representation to require
   -- 8 bytes/index if both bounds and the step are constants,
   -- ~32 bytes otherwise
   by refering to the appropriate array type in the former case
   and TREE_VEC containing the three values in the latter case,
   or some similar representation

Zdenek


Re: Proposal: changing representation of memory references

2007-04-05 Thread Zdenek Dvorak
Hello,

> >Remarks:
> >-- it would be guaranteed that the indices of each memory reference are
> >   independent, i.e., that &ref[idx1][idx2] == &ref[idx1'][idx2'] only
> >   if idx1 == idx1' and idx2 = idx2'; this is important for dependency
> >   analysis (and for this reason we also need to remember the list of
> >   indices, rather than trying to reconstruct them from the address).
> 
> I didn't completely think through this,
> but how would this property help in the presence of array flattening ?
> e.g. suppose two adjacent loops, both two-nest-deep,
> and only one of them is flattened, then
> one loop will have one dimensional access (hence will have only one index)
> vs the other loop will have two dimensional.
> In other words, &ref[idx1] == &ref[idx1'][idx2'] can happen.

this cannot happen.  Suppose you have a declaration

int ref[100][100];

Then the first reference would be represented as
tmp = (int (*)[1])) &ref;
tmp[idx1]

It cannot have ref as its base.

Similarly, if one wanted to implement the reverse transformation,
recovering multidimensional arrays from the flat representation,
he would have to transform

int a[1];
for (i = 0; i < 100; i++)
  for (j =  0; j < 100; j++)
a[100 * i + j] = ...

to

tmp = (int (*)[100][100]) &a;
for (i = 0; i < 100; i++)
  for (j =  0; j < 100; j++)
ref with base: tmp, indices: (i,j) = ...

> Another question is, how would the representation look
> for more complicated address calculations
> e.g. a closed hashtable access like:
> 
> table[hash_value % hash_size]
> 
> and would it help in such cases ?

I am not quite sure what you ask for here; this would just be
represented as

idx = hash_value % hash_size;

and reference with base: table, indices: (idx)

Zdenek


Re: adding dependence from prefetch to load

2007-04-12 Thread Zdenek Dvorak
Hello,

> 2. Right now I am inserting a __builting_prefetch(...) call immediately 
> before the actual read, getting something like:
>  D.1117_12 = &A[D.1101_14];
>  __builtin_prefetch (D.1117_12, 0, 1);
>  D.1102_16 = A[D.1101_14];
> 
> However, if I enable the instruction scheduler pass, it doesn't realize 
> there's a dependency between the prefetch and the load, and it actually 
> moves the prefetch after the load, rendering it useless. How can I 
> instruct the scheduler of this dependence?
> 
> My thinking is to also specify a latency for prefetch, so that the 
> scheduler will hopefully place the prefetch somewhere earlier in the 
> code to partially hide this latency. Do you see anything wrong with this 
> approach?

well, it assumes that the scheduler works with long enough lookahead to
actually be able to move the prefetch far enough; i.e., if the
architecture you work with is relatively slow in comparison with the
memory access times, this might be feasible approach.  However, on
modern machines, miss in L2 cache may take hundreds of cycles, and it is
not clear to me that scheduler will be able to move the prefetch so far,
or indeed, that it would even be possible (I think often you do not
know the address far enough in advance).  Also, prefetching outside of
loops in general appears not to be all that profitable, since usually most of 
the
time is spent within loops.

So I would recommend first doing some analysis and measurements (say by
introducing the prefetches by hand) to check whether this project really
has potential to lead to significant speedups.

Zdenek


Re: adding dependence from prefetch to load

2007-04-12 Thread Zdenek Dvorak
Hello,

> Well, the target architecture is actually quite peculiar, it's a 
> parallel SPMD machine. The only similarity with MIPS is the ISA. The 
> latency I'm trying to hide is somewhere around 24 cycles, but because it 
> is a parallel machine, up to 1024 threads have to stall for 24 cycles in 
> the absence of prefetching, which affects overall performance.
> My initial studies show that this latency can be hidden with a properly 
> inserted prefetch instruction, and I think that the scheduler can help 
> with that, if properly guided.
> 
> So my initial question remains: is there any way to tell the scheduler 
> not to place the prefetch instruction after the actual read?
> 
> The prefetch instruction takes an address_operand, and it seems all I 
> need to do is tell the scheduler prefetch will "write" to that address, 
> so it will see a true dependence between the prefetch and the read.

this might also restrict the possibility to move the prefetch, since it
would prevent it from being moved over any other memory operations that
alias with the prefetched one.

Unfortunately, I do not really understand the internals of the gcc
scheduler, so I cannot give you more concrete help; but hopefully
someone else will.

Zdenek

> But 
> I don't know how to do that, and changing the md file to say  "+p" or 
> "+d" for the first operand of the prefetch didn't help.


Re: GCC mini-summit - compiling for a particular architecture

2007-04-20 Thread Zdenek Dvorak
Hello,

> Steve Ellcey wrote:
> 
> >This seems unfortunate.  I was hoping I might be able to turn on loop
> >unrolling for IA64 at -O2 to improve performance.  I have only started
> >looking into this idea but it seems to help performance quite a bit,
> >though it is also increasing size quite a bit too so it may need some
> >modification of the unrolling parameters to make it practical.
> 
> To me it is obvious that optimizations are target dependent. For
> instance loop unrolling is really a totally different optimization
> on the ia64 as a result of the rotating registers.

that we do not use.  Nevertheless, there are still compelling reasons
for why unrolling is more useful on ia64 then on other architectures
(importance of scheduling, insensitivity to code size growth).

Another option would be to consider enabling (e.g.) -funroll-loops
-fprefetch-loop-arrays by default on -O3.  I think it is fairly rare
for these flags to cause performance regressions (although of course
more measurements to support this claim would be necessary).

Zdenek


Re: GCC mini-summit - compiling for a particular architecture

2007-04-22 Thread Zdenek Dvorak
> Look from what we're starting:
> 
> <<
> @item -funroll-loops
> @opindex funroll-loops
> Unroll loops whose number of iterations can be determined at compile
> time or upon entry to the loop.  @option{-funroll-loops} implies
> @option{-frerun-cse-after-loop}.  This option makes code larger,
> and may or may not make it run faster.
> 
> @item -funroll-all-loops
> @opindex funroll-all-loops
> Unroll all loops, even if their number of iterations is uncertain when
> the loop is entered.  This usually makes programs run more slowly.
> @option{-funroll-all-loops} implies the same options as
> @option{-funroll-loops},
> >>
> 
> It could gain a few more paragraphs written by knowledgeable people.
> And expanding documentation doesn't introduce regressions :).

but also does not make anyone actually use the options.  Nobody reads
the documention.  Of course, this is a bit overstatement, but with a
few exceptions, people in general do not enable non-default flags.

Zdenek


Re: GCC mini-summit - compiling for a particular architecture

2007-04-22 Thread Zdenek Dvorak
Hello,

> On Sun, 2007-04-22 at 14:44 +0200, Richard Guenther wrote:
> > On 4/22/07, Laurent GUERBY <[EMAIL PROTECTED]> wrote:
> > > > > but also does not make anyone actually use the options.  Nobody reads
> > > > > the documention.  Of course, this is a bit overstatement, but with a
> > > > > few exceptions, people in general do not enable non-default flags.
> > > >
> > > > I don't think this is fair.
> > > > Most people don't read the docs because they don't care about
> > > > performance, but most people who develop code that spends a lot of CPU
> > > > cycles actually read the docs at least up to loop unrolling.
> > >
> > > Exactly my experience.
> > >
> > > Unfortunately there's no useful information on this topic in the GCC
> > > manual...
> > 
> > Well, we have too many switches really.  So the default is use -O2.  If you
> > want extra speed, try -O3, or even better use profile feedback.  (Not many
> > people realize that with profile feedback you get faster code than with
> > -O3 and smaller code than with -Os - at least for C++ programs)
> 
> At work we use -O3 since it gives 5% performance gain against -O2.
> profile-feedback has many flags

actually, only two that are really important -- -fprofile-generate
and -fprofile-use.

Zdenek


Re: GCC 4.2.0 Status Report (2007-04-24)

2007-04-25 Thread Zdenek Dvorak
Hello,

> 4. PR 31360: Missed optimization
> 
> I don't generally mark missed optimization bugs as P1, but not hoisting
> loads of zero out of a 4-instruction loop is bad.  Zdenek has fixed this
> on mainline.  Andrew says that patch has a bug.  So, what's the story here?

I found the problem, I will send a patch once it passes regtesting.

Zdenek


Re: A problem with the loop structure

2007-04-28 Thread Zdenek Dvorak
Hello,

> (based on gcc 4.1.1).

now that is a problem; things have changed a lot since then, so I am not
sure how much I will be able to help.

> 1. The problem was unveiled by compiling a testcase with dump turned
> on. The compilation failed while calling function get_loop_body from
> flow_loop_dump on the following assert :
> 
>  else if (loop->latch != loop->header)
>{
>  tv = dfs_enumerate_from (loop->latch, 1, glb_enum_p,
>   tovisit + 1, loop->num_nodes - 1,
>   loop->header) + 1;
> 
> 
>  gcc_assert (tv == loop->num_nodes);
> 
> The compilation exits successfully if compiled without enabling the dump.

this means that there is some problem in some loop transformation,
forgetting to record membership of some blocks to their loops or something
like that.

> 2. SMS pass contained a single call to loop_version on the loop to be
> SMSed. This happened before any SMS related stuff was done. Trying to
> call verify_loop_structure(loops) just after the call to loop_version
> failed on the same assert in get_loop_body as in (1). The loop on
> which we fail is neither the versioned loop nor the new loop.

Probably it is their superloop?

> Below
> there are dumps to verify_loop_structure called from different places
> in loop_version:

These dumps are not very useful, loop structures do not have to be
consistent in the middle of the transformation.

> 3.  At the very beginning of the SMS pass we build the loop structure
> using build_loops_structure defined in modulo-sched.c. Just after the
>
> call I tried to print in gdb the loop on which we failed in
> get_loop_body. This failed as well
>
> (gdb)  p print_loop(dumpfile, 0xbabe20, 0)

Do not use print_loop, that only works on gimple.  Use flow_loops_dump
to find which blocks belong to which loops, and possibly debug_bb_n
if you need to see the contents of the blocks.

Zdenek


Re: GCC 4.1 Projects

2005-02-27 Thread Zdenek Dvorak
Hello,

> >Although you have listed it as "stage 2", I wish to commit the finished
> >portion as soon as possible during stage 1.  I have maintainership 
> >authority
> >to do so.  This will not interfere in any way with *any* of the projects
> >approved for stage 1, since it is in a disjoint section of code.  
> 
> If it breaks bootstrap, it will definitely interfere.  If it causes 
> patch conflicts with other changes it will also interfere.  And if it 
> doesn't cause any patch conflicts, then it probably won't be very hard 
> to maintain on a branch.
> 
> > Accordingly, I plan to do so unless I am told not to.
> 
> I would certainly prefer that you hold off until Stage 2, as indicated 
> by the documented I posted.

I must admit I have very bad feeling about the whole "4.1 Projects"
stuff.  IMHO this over-organizes things.  If people in general disagree
with the Nathan's changes, or if there are any reasons to think that
they are not tested enough or whatever when he submits them, of course
that is something else.  But I don't think having just a single person
decide which patches may go in and which must wait, or even just judging
their importance, is a good idea.

Argument "If it breaks bootstrap, ..." -- well, if it is a large or
intrusive patch, it is natural to expect it to be tested even more than
the rules request, including bootstrap on several platforms.  Some
breakage during Stage 1 is probably unavoidable.  I did not notice any
significant problems due to this since I started working on gcc (2001),
however.

Zdenek


Re: GCC 4.1 Projects

2005-02-27 Thread Zdenek Dvorak
Hello,

> >>>>> Zdenek Dvorak writes:
> 
> Zdenek> I must admit I have very bad feeling about the whole "4.1 Projects"
> Zdenek> stuff.  IMHO this over-organizes things.  If people in general 
> disagree
> Zdenek> with the Nathan's changes, or if there are any reasons to think that
> Zdenek> they are not tested enough or whatever when he submits them, of course
> Zdenek> that is something else.  But I don't think having just a single person
> Zdenek> decide which patches may go in and which must wait, or even just 
> judging
> Zdenek> their importance, is a good idea.
> 
>   Given the number of branches and major changes queued up for GCC
> 4.1, we need to coordinate the way that they are merged to try to avoid
> too much breakage and disruption.  Mark is the GCC Release Manager and he
> has the authority to try to herd the cats and manage this process.

at the beginning of the stage 1, there always is lot of major changes
queued up.  It never lead to unmanageable amount of "breakage and
disruption".

>   I would appreciate if you and everyone else who have disagreements
> with individual decisions would stop turning these into complaints about
> the entire process.

This was not my intention.  It may well be that the Nathan's case is the
only one where the plan causes significant problems to anyone.  I simply
do not like the idea of someone centrally planning the future of GCC,
and I would complain even if things worked perfectly.

> There is no perfect solution.  I'm sure that we all would be happy to
> consider constructive suggestions, but complaining about the abstract
> concept of someone else having control or making decisions is not
> helpful.

As for constructive suggestions -- why not just forget on whole plan
and let the things work out the way they always did?  As you say,
there is no perfect solution, so why to complicate people's lives with
trying to get one?

Zdenek


Re: GCC 4.1 Projects

2005-02-27 Thread Zdenek Dvorak
Hello,

> (This time around, the there seemed to be consensus that all proposals 
> be made public upon submission.  Since that seems to be the consensus, 
> I'll implement that in the next release cycle.)

yes, this would definitely make sense.

Zdenek


Re: GCC 4.1 Projects

2005-02-27 Thread Zdenek Dvorak
Hello,

> >at the beginning of the stage 1, there always is lot of major changes
> >queued up.  It never lead to unmanageable amount of "breakage and
> >disruption".
> 
> Other people have a different perception than you do.
> 
> >As for constructive suggestions -- why not just forget on whole plan
> >and let the things work out the way they always did?  As you say,
> >there is no perfect solution, so why to complicate people's lives with
> >trying to get one?
> 
> I can't say I found it incredibly thrilling to try to work out the 
> project ordering, and I certainly have plenty of other work to do.  If 
> there's consensus that this was a bad idea, I'm happy not to do it 
> again.  However, one person has already said that they found it valuable.
> 
> I think it's a little early to start the post-mortem.  Perhaps in Stage 
> 3, we could have a better discussion about how well this worked out. 
> Or, perhaps it will become obvious before then, one way or the other.

agreed.  Anyway, thank you for all the work you spend on trying to
make the whole thing work; while I have my objections against the idea,
it may turn out that most of the people work in a different way
and will actually prefer it (I admit to enjoy a controlled amount of
chaos :-)

Zdenek


Re: GIV optimizations

2005-03-01 Thread Zdenek Dvorak
Hello,

> Giv optimizations are just features which not 
> implemented yet in the new loop unroller, so I think 
> put it in bugzilla is not appropriate.

it most definitely is appropriate.  This is a performance
regression.  Even if it would not be, feature requests
can be put to Bugzilla.

The best of course would be if you could create a small testcase
demonstrating what you would like the compiler to achieve.

Zdenek


Re: Strange missing tree constant propagation

2005-03-10 Thread Zdenek Dvorak
Hello,

> Why is D.1476 not being propagated?  IVOPTS introduces it,
> but I don't see any reason why...
> 
> Also, why all the leading zeros?  Is there something special
> about that constant?  The initial RTL gcc produces for the
> assignment to D.1476 is also suboptimal:
> 
> ;; D.1476 = -1
> (insn 21 19 22 (set (reg:SI 64)
> (const_int -1 [0x])) -1 (nil)
> (nil))
> 
> (insn 22 21 0 (set (reg:SI 62 [ D.1476 ])
> (reg:SI 64)) -1 (nil)
> (nil))
> 
> Strange...  Does anyone know a reason for why this happens?

I think the constant has TREE_OVERFLOW set; and from mostly historical
reasons optimizers are very conservative when dealing with such
constants.  There are two things that need to be done.  The first is to
check why ivopts produce the overflowed constant and fix that (ivopts
should not produce unnecessary overflowed constants, since they are
handled in a very conservative way in fold).  The second one is to
remove the restrictions (remove the check for TREE_OVERFLOW from
is_gimple_min_invariant) and deal with the problems it exposes in tree
-> rtl expansion.

I think I will try the later fix today (mostly because I no longer
remember what exactly were the problems that lead me to introducing
the TREE_OVERFLOW check to is_gimple_min_invariant).

Zdenek


Re: Strange missing tree constant propagation

2005-03-10 Thread Zdenek Dvorak
Hello,

> >I think I will try the later fix today (mostly because I no longer
> >remember what exactly were the problems that lead me to introducing
> >the TREE_OVERFLOW check to is_gimple_min_invariant).
> 
> http://gcc.gnu.org/ml/gcc-patches/2003-12/msg00786.html
> And yes the testcase still fails when you remove TREE_OVERFLOW from
> is_gimple_min_invariant, at least it did three weeks ago when I tried
> this.

actually, it does not for me (after fixing a trivial problem in
find_taken_edge_cond_expr).  I am now bootstrapping and regtesting
the patch on every architecture I have access to.

Zdenek


Re: Weird behavior in ivopts code

2005-03-18 Thread Zdenek Dvorak
Hello,

> Which appears to walk down the array and try and choose better IV sets.
> Since it walks down the IV array, which is in SSA_NAME_VERSION order.
> Thus two loops which are equivalent in all ways except that they use
> different SSA_NAME_VERSIONs can get different IV sets.
> 
> Anyway, the instability of the IV opts code when presented with loops
> that differ only in the version #s in the SSA_NAMEs they use is really
> getting in the way of understanding the performance impact of the
> new jump threading code.  I would expect this instability to also make
> it difficult to analyze the IVopts in general.

there's not much to do about the matter.  The choice of ivs in ivopts is
just a heuristics (and this cannot be changed, due to compiler
performance reasons), and as such it is prone to such instabilities.
In fact, both choices of ivs make sense, and they have the same cost
w.r. to the cost function used by ivopts.

Anyway, in this particular case, the patch below should make ivopts
to prefer the choice of preserving the variable 'i' (which is better
from the point of preserving debugging information).

Zdenek

Index: tree-ssa-loop-ivopts.c
===
RCS file: /cvs/gcc/gcc/gcc/tree-ssa-loop-ivopts.c,v
retrieving revision 2.21
diff -c -3 -p -r2.21 tree-ssa-loop-ivopts.c
*** tree-ssa-loop-ivopts.c  27 Oct 2004 20:27:20 -  2.21
--- tree-ssa-loop-ivopts.c  27 Oct 2004 20:30:19 -
*** determine_iv_cost (struct ivopts_data *d
*** 3326, 
  
cand->cost = cost_step + cost_base / AVG_LOOP_NITER (current_loop);
  
!   /* Prefer the original iv unless we may gain something by replacing it.  */
!   if (cand->pos == IP_ORIGINAL)
  cand->cost--;

/* Prefer not to insert statements into latch unless there are some
--- 3327,3337 
  
cand->cost = cost_step + cost_base / AVG_LOOP_NITER (current_loop);
  
!   /* Prefer the original iv unless we may gain something by replacing it;
!  this is not really relevant for artificial ivs created by other
!  passes.  */
!   if (cand->pos == IP_ORIGINAL
!   && !DECL_ARTIFICIAL (SSA_NAME_VAR (cand->var_before)))
  cand->cost--;

/* Prefer not to insert statements into latch unless there are some


Re: Question on tree-ssa-loop-im.c:for_each_index

2005-03-19 Thread Zdenek Dvorak
Hello,

> VIEW_CONVERT_EXPR is tcc_reference, but we can have a statement like:
> 
>   x =  22;

what is the semantics of this expression?  Should not this rather be

x =  22

(or just INTEGER_CST:some_type 22)?

Zdenek

> What ends up happening here is that find_interesting_uses_stmt calls
> find_interesting_uses_address, which goes down the references and
> runs into the constant, which it doesn't know how to handle.  I think
> the simplest fix is below.  It's certainly safe in that it converts
> an ICE into not aborting and seems to do the right thing.  The test case
> is in Ada and proprietary and it didn't seem worthwhile to make a small
> one for something this simple.
> 
> Does the following look correct?
> 
> *** tree-ssa-loop-im.c11 Mar 2005 09:05:10 -  2.31
> --- tree-ssa-loop-im.c19 Mar 2005 00:12:02 -
> *** for_each_index (tree *addr_p, bool (*cbc
> *** 195,198 
> --- 195,199 
>   case STRING_CST:
>   case RESULT_DECL:
> + case INTEGER_CST:
> return true;
>   


Re: Question on tree-ssa-loop-im.c:for_each_index

2005-03-20 Thread Zdenek Dvorak
Hello,

> > x =  22;
> 
>  what is the semantics of this expression?  Should not this rather be
> 
>  x =  22
> 
>  (or just INTEGER_CST:some_type 22)?
> 
> Depends what the type is.  If it's an array type, then there's no
> simple equivalent expression.

using CONSTRUCTOR node?

Zdenek


Re: Question on tree-ssa-loop-im.c:for_each_index

2005-03-20 Thread Zdenek Dvorak
Hello,

> > Depends what the type is.  If it's an array type, then there's no
> > simple equivalent expression.
> 
> using CONSTRUCTOR node?
> 
> What I mean by "simple" is something that's easy to derive.  Suppose I have
> a record with numerous fields of various sizes and I unchecked convert a
> constant to it.  Yes, I could write code to convert that into a CONSTRUCTOR
> by figuring out what value should be assigned to each field, but that's a
> pretty large block of code for a pretty marginal case.

however, avoiding possibly a large amount of bugs in code that does not
expect this corner case.  I would certainly consider it much cleaner
solution than adding hacks to for_each_index and possibly other places
that do not expect something as weird.

That being said, I think your patch is safe and will not break any of
the uses of for_each_index.

Zdenek


Re: [rtl-optimization] Improve Data Prefetch for IA-64

2005-03-27 Thread Zdenek Dvorak
Hello,

> On Sunday 27 March 2005 03:53, Canqun Yang wrote:
> > The last ChangeLog of rtlopt-branch was written in
> > 2003. After more than one year, many impovements in
> > this branch haven't been put into the GCC HEAD. Why?
> 
> Almost all of the rtlopt branch was merged.  Prefetching is one
> of the few things that was not, probably Zdenek knows why.

because I never made it work as well as the current version,
basically.  At least no matter how much I tried, I never produced
any benchmark numbers that would justify the change.  Then I went
to tree-ssa (and tried to write prefetching there, and got stuck on
the same issue, after which I simply forgot due to loads of other work
:-( ).

Zdenek


Re: [rtl-optimization] Improve Data Prefetch for IA-64

2005-03-27 Thread Zdenek Dvorak
Hello,

> On Sunday 27 March 2005 04:45, Canqun Yang wrote:
> > Another question is why the new RTL loop-unroller does
> > not support giv splitting.
> 
> Apparently because for most people it is not a problem that it does
> not do it, and while you have indicated earlier that it may be useful
> for you, you have neither tried to implement it yourself, nor provided
> a test case to PR20376.
> 
> FWIW you could try -fweb and see if it does what you want.  And if it
> does, you could write a limited webizer patch that works on just loop
> bodies after unrolling.

from what I understood, change to analyze_iv_to_split_insn that would
detect memory references should suffice.  Also maybe this patch might be
relevant:

http://gcc.gnu.org/ml/gcc-patches/2004-10/msg01176.html

Zdenek


Re: [rtl-optimization] Improve Data Prefetch for IA-64

2005-04-05 Thread Zdenek Dvorak
Hello,

> > Steven Bosscher <[EMAIL PROTECTED]>:
> > 
> > >
> > > What happens if you use the memory address unrolling 
> > patch, turn on
> > > -fweb, and set the unrolling parameters properly?
> > >
> > 
> > The memory address unrolling patch can't work on IA-
> > 64,
> 
> Ah, no base+offset.  It could be made to work with a little
> work though.  The IV splitting needs to be taught about GIVs,
> and when that is in place, it should work, right?
> (IMHO the current RTL IV analysis is not very nice to work
> with, we could look into improving that first...)

what problems do you have concretely in mind?

> On the other hand, we may not even need it if the RTL unroller
> will be replaced by the tree one.  I don't know what Zdenek's
> plans are about that, or if this is a wise thing to do at all.

I don't think it is the case; as far as I know, it is recommended
to do non-specific unrolling just before or during scheduling.
(By "non-specific" I mean unrolling the loops without any other
reason other than just getting them unrolled).  

Zdenek


Re: Merging stmt_ann_d into tree_statement_list_node

2005-04-25 Thread Zdenek Dvorak
Hello,

> I am thinking about merging stmt_ann_d into tree_statement_list_node.
>
> Background
> --
> 
> tree_statement_list_node is a linked list node defined like so
> 
> struct tree_statement_list_node {
>   struct tree_statement_list_node *prev;
>   struct tree_statement_list_node *next;
>   tree stmt;
> };
> 
> stmt_ann_d is an annotation for a statement ("stmt" above) that
> includes operand cache, the basic block that the statement belongs to,
> and various flags among other things.

guessing from (*) below, you probably just forgot to mention that it also
involves changing SSA_NAME_DEF_STMT to point to the
tree_statement_list_node rather than the statement itself,
otherwise there is no way how to get the
annotation for SSA_NAME_DEF_STMT (name)?

> Justifications
> --
> 
> o We have a 1:1 correspondence between tree_statement_list_node and
>   stmt_ann_d.  Why have separate data structures?
> 
> o Cache win.  Calling bsi_stmt loads tree_statement_list_node into
>   memory, which might in turn bring members of stmt_ann_d into cache.
>   Note that it's very common to call bsi_stmt and then do something
>   with operand cache while walking statements in a basic block.
> 
> o bsi_for_stmt becomes an O(1) operation.

(*)

>   Steven Bosscher says that
>   the CFG inliner will hit bsi_for_stmt hard.

This is just because CFG inliner is badly written; Honza has already
fixed that in his private tree.

> o Less memory management overhead.
> 
> o One step closer to a flat statement structure (or a tuple form if
>   you like).  It's a bit silly that we have GIMPLE, but we still do
>   not have a flat structure.
> 
> o We can probably remove get_stmt_ann and replace all uses of
>   stmt_ann.  No more lazy stmt_ann allocation.
> 
> Thoughts?

It is definitely a good idea, IMHO.

Zdenek


Re: Merging stmt_ann_d into tree_statement_list_node

2005-04-25 Thread Zdenek Dvorak
Hello,

> > o One step closer to a flat statement structure (or a tuple form if
> >   you like).  It's a bit silly that we have GIMPLE, but we still do
> >   not have a flat structure.

on a side note, I am currently working on cleaning up gimplification
(mostly to enable embedding of simple and fast cleanup passes like local
value numbering, copy/constant propagation and unreachable code
removal).  As a side effect, this will cause gimplifier to directly
produce gimple statements instead of gradually rewriting the function
tree.  I will tell more about this project on GCC summit.

Once this is done, I would like to start working on moving gimple out of
trees to the "flat" structure.

Zdenek


Re: Merging stmt_ann_d into tree_statement_list_node

2005-04-25 Thread Zdenek Dvorak
Hello,

> > Then why not simply shorten this to:
> > 
> > 1) put stmt annoation directly in the tree_statement_list_node
> > 
> > 2) replace stmt_ann_t pointer in stmt with a pointer to its BSI, or the
> > tree_statement_list_node.  This makes bsi_from_stmt O(1) immediately.
> > 
> > 3) all stmts now have access to the stmt_ann directly in the
> > stmt_list_node, nothing changes except get_stmt_ann() returns the
> > address of the stmt ann within the tree_stmt_list_node.
> > 
> > 4) For fake stmts you will have to allocate a fake tree_stmt_list_node
> > which isnt linked with anything, and set the bsi pointer to that... then
> > you have the annotation and access to it from the stmt.
> > 
> > that seems like a quicker more direct approach then phasing it in
> > slowly, with less impact to anyone, and gives you immediate benefits. I
> > would think the patch to do this would be fairly small and
> > non-intrusive.
> 
> This looks like a better approach.  How would we do your step 1?  We
> have var_ann and tree_ann in addition to stmt_ann.  Shall we put a
> type field at the beginning of tree_statement_list_node+stmt_ann_d so
> that an annotation node can identify itself?  (Since all these tree
> annotations already have a field for annotation type, it's more like
> appending tree_statement_list_node to stmt_ann_d.)

I would go just for having

union
{
  struct stmt_list_node *container; /* For gimple statements.  */
  tree_ann_t ann;   /* For everything else.  */
}

(Plus some GTY annotations around that enable the garbage collector to
recognize this case; as far as my memory serves, I needed something
similar once and it was doable).  This way you won't waste space
for the useless tree_statement_list_node "type field".

Zdenek


Re: Merging stmt_ann_d into tree_statement_list_node

2005-04-25 Thread Zdenek Dvorak
Hello,

> > > This looks like a better approach.  How would we do your step 1?  We
> > > have var_ann and tree_ann in addition to stmt_ann.  Shall we put a
> > > type field at the beginning of tree_statement_list_node+stmt_ann_d so
> > > that an annotation node can identify itself?  (Since all these tree
> > > annotations already have a field for annotation type, it's more like
> > > appending tree_statement_list_node to stmt_ann_d.)
> > 
> > I would go just for having
> > 
> > union
> > {
> >   struct stmt_list_node *container; /* For gimple statements.  */
> >   tree_ann_t ann;   /* For everything else.  */
> > }
> 
> Err, I don't see how to tell the garbage collector about this without
> a type field.  We cannot rely on TREE_CODE (stmt) because CALL_EXPR
> may appear by itself or as an rhs of MODIFY_EXPR.

in the later case, they don't have annotations?  But OK, there may be
some other case I am forgetting now, so perhaps making it safe and
having a type field is a better idea.

Zdenek


Re: Testcase for loop in try_move_mult_to_index?

2005-04-29 Thread Zdenek Dvorak
Hello,

> I cannot see how fold-const.c:try_move_mult_to_index ever
> successfully will do a transformation if the loop
> 
>   for (;; ref = TREE_OPERAND (ref, 0))
> {
>   if (TREE_CODE (ref) == ARRAY_REF)
> {
> ...
>   break;
> }
> 
>   if (!handled_component_p (ref))
> return NULL_TREE;
> }
> 
> will not break in the first iteration.  What kind of complex
> trees are we trying to match here?

for example 

struct x
{
  int x;
  int y;
} a[1000];


&a[i].x + 8 * j is transformed to &a[i+j].x

Zdenek


Re: Struggle with FOR_EACH_EDGE

2005-05-01 Thread Zdenek Dvorak
Hello,

> >To see what kind of code GCC generates for FOR_EACH_EDGE, I wrote a
> >simple dummy function.
> 
> Kazu, I hope I have time to look in detail at your investigation, however
> one thing that might be causing a problem is that FOR_EACH_EDGE is an 
> iterator
> that works for a non-constant VEC.  This means there's an extra level of
> indirection that the optimizer cannot remove, unless it completely inlines
> the body of the loop (there might be some cases it can do it without 
> inlining,
> but I suspect not).  I wonder if it is worthwhile implementing FOR_EACH_EDGE
> and FOR_EACH_EDGE_VOLATILE (I want the default, shorter name, to be the
> constant iterator, so that you have to chose to use the non-constant one).
> 
> Even with a constant iterator, it might be necessary to help the optimizer
> by copying the vector length to somewhere local that the optimizer can see
> cannot be changed by the body of the loop.  Hmm, I guess that does mean you
> need a proper iterator object, rather than the integer index that 
> VEC_iterate
> employs.

you can always start the index from number of edges - 1 and iterate to zero.
But a proper iterator might be useful anyway.

Zdenek


Problem with -ftree-ter and loop unrolling

2005-05-18 Thread Zdenek Dvorak
Hello,

I am just playing with loop unrolling on trees (for the purposes of
prefetching), and I encountered the following problem.  Consider loop

sum1 = 0;
sum2 = 0;
for (i = 0; i < n; i++)
  {
x_1 = a[i];
y_1 = b[i];
sum1 += x_1 * y_1;
sum2 += x_1 / y_1;
  }

Now after unrolling we get (with some handling for remaining iterations
of the loop)

sum1 = 0;
sum2 = 0;
for (i = 0; i < n; i+=4)
  {
x_1 = a[i];
y_1 = b[i];
sum1 += x_1 * y_1;
sum2 += x_1 / y_1;
x_2 = a[i+1];
y_2 = b[i+1];
sum1 += x_2 * y_2;
sum2 += x_2 / y_2;
x_3 = a[i+2];
y_3 = b[i+2];
sum1 += x_3 * y_3;
sum2 += x_3 / y_3;
x_4 = a[i+3];
y_4 = b[i+3];
sum1 += x_4 * y_4;
sum2 += x_4 / y_4;
  }

So far OK, but with ter, this becomes

sum1 = 0;
sum2 = 0;
for (i = 0; i < n; i+=4)
  {
x_1 = a[i];
y_1 = b[i];
x_2 = a[i+1];
y_2 = b[i+1];
x_3 = a[i+2];
y_3 = b[i+2];
x_4 = a[i+3];
y_4 = b[i+3];
sum1 += x_1 * y_1 + x_2 * y_2 + x_3 * y_3 + x_4 * y_4;
sum2 += x_1 / y_1 + x_2 / y_2 + x_3 / y_3 + x_4 / y_4;
  }

Now we need some 11 registers for the loop, instead of the original 5
(and the number of registers grows with the unroll factor).

This causes huge regressions (more than 60% on sixtrack from spec2000).
It might perhaps be a good idea to limit the size of expressions ter
produces, or in some other way restrict the way it increases the lives
of registers?

Zdenek


Re: Problem with -ftree-ter and loop unrolling

2005-05-18 Thread Zdenek Dvorak
Hello,

> > > So far OK, but with ter, this becomes
> > > 
> > > sum1 = 0;
> > > sum2 = 0;
> > > for (i = 0; i < n; i+=4)
> > >   {
> > > x_1 = a[i];
> > > y_1 = b[i];
> > > x_2 = a[i+1];
> > > y_2 = b[i+1];
> > > x_3 = a[i+2];
> > > y_3 = b[i+2];
> > > x_4 = a[i+3];
> > > y_4 = b[i+3];
> > > sum1 += x_1 * y_1 + x_2 * y_2 + x_3 * y_3 + x_4 * y_4;
> > > sum2 += x_1 / y_1 + x_2 / y_2 + x_3 / y_3 + x_4 / y_4;
> > >   }
> > > 
> > > Now we need some 11 registers for the loop, instead of the original 5
> > > (and the number of registers grows with the unroll factor).
> > 
> > The TER hack we settled on for PR17549 was supposed to prevent this kind
> > of thing, but it was already obvious at the time that a better fix is
> > needed in the general case.  You've find a pretty nasty one here.
> 
> Why didn't it trigger?  I can't reproduce it by a bit of simple hacking
> around, have you got a little testcase and options to turn on to produce
> this? 

-O1 suffices.  The (sum? + 1) is needed to workaround the hack
introduced to fix PR17549 (and it is very close to what happens in
sixtrack, except that there the operation with the accumulated variable
is a bit more complicated).

Zdenek

int a[200], b[200];

void xxx(void)
{
  int i, sum1 = 0, sum2 = 0, x, y;

  for (i = 0; i < 200; i+=8)
{
  x = a[i];
  y = b[i];
  sum1 = (sum1 + 1) + x * y;
  sum2 = (sum2 + 1) + x / y;

  x = a[i+1];
  y = b[i+1];
  sum1 = (sum1 + 1) + x * y;
  sum2 = (sum2 + 1) + x / y;

  x = a[i+2];
  y = b[i+2];
  sum1 = (sum1 + 1) + x * y;
  sum2 = (sum2 + 1) + x / y;

  x = a[i+3];
  y = b[i+3];
  sum1 = (sum1 + 1) + x * y;
  sum2 = (sum2 + 1) + x / y;

  x = a[i+4];
  y = b[i+4];
  sum1 = (sum1 + 1) + x * y;
  sum2 = (sum2 + 1) + x / y;

  x = a[i+5];
  y = b[i+5];
  sum1 = (sum1 + 1) + x * y;
  sum2 = (sum2 + 1) + x / y;

  x = a[i+6];
  y = b[i+6];
  sum1 = (sum1 + 1) + x * y;
  sum2 = (sum2 + 1) + x / y;

  x = a[i+7];
  y = b[i+7];
  sum1 = (sum1 + 1) + x * y;
  sum2 = (sum2 + 1) + x / y;
}

  bla (sum1, sum2);
}



Re: Type mismatch in tree-ssa-loop-ivopts.c:5436 (rewrite_address_base)

2005-05-18 Thread Zdenek Dvorak
Hello,

> Don't know how to fix this - nothing obvious.  But we create at
> 
>   *op = build1 (INDIRECT_REF, TREE_TYPE (*op), with);
> 
> an INDRECT_REF of the form
> 
>   type  size 
> unit size 
> align 8 symtab 1075593216 alias set -1 precision 8 min
>  max 
> pointer_to_this >
> 
> arg 0  type  frame_state>
> public unsigned SI
> size 
> unit size 
> align 32 symtab 0 alias set -1>
> var  def_stmt  0x4066c1f8>
> version 38
> ptr-info 0x4065b8ac>>
> 
> where we confuse type  and
> type .  Testcase is compiling
> with -O2 -g
> 
> /net/alwazn/home/rguenth/src/gcc/cvs/gcc-4.1/gcc/unwind-dw2.c: In function
> '__frame_state_for':
> /net/alwazn/home/rguenth/src/gcc/cvs/gcc-4.1/gcc/unwind-dw2.c:1036
> 
> 
> Please fix / advice.

rewrite_address_base is a piece of hack needed to cheat the alias
representation system.  I would not worry about it much now, once the
TARGET_MEM_REF patch is approved, this whole function will be removed.

Zdenek


Re: tree-ssa-address ICE

2005-06-08 Thread Zdenek Dvorak
Hello,

> On Wednesday 08 June 2005 12:01, Nick Burrett wrote:
> > $ ./cc1 -quiet test.c -mthumb -O2
> > ../../bug.c: In function ?foo?:
> > ../../bug.c:3: internal compiler error: in create_mem_ref, at
> > tree-ssa-address.c:585
> > Please submit a full bug report,
>   ^^^
> ;-)
> 
> 
> And some more information in the mean time:
> 
> Starting program: /home/steven/devel/build-test/gcc/cc1 -mthumb t.c -O
>  foo
> 
> Breakpoint 1, fancy_abort (file=0xa19258 
> "../../mainline/gcc/tree-ssa-address.c", line=585,
> function=0xa192d3 "create_mem_ref") at diagnostic.c:588
> 588   internal_error ("in %s, at %s:%d", function, trim_filename (file), 
> line);
> (gdb) up
> #1  0x005930da in create_mem_ref (bsi=0x7fbfffd370, 
> type=0x2a95896750, addr=0x7fbfffd390)
> at tree-ssa-address.c:585
> 585   gcc_unreachable ();
> (gdb) p debug_generic_stmt(bsi.tsi.ptr.stmt)
> #   VUSE ;
> c1D.1197_10 = *s1D.1192_27;
> 
> $12 = void
> (gdb) p parts
> $13 = {symbol = 0x0, base = 0x0, index = 0x2a95981380, step = 0x0, offset = 
> 0x0}

a patch, I will submit it once passes regtesting (only the
tree-ssa-address.c:addr_for_mem_ref part is important, but I have rather
changed also the tree-ssa-loop-ivopts.c parts that could cause similar
problems).

Zdenek

Index: tree-ssa-address.c
===
RCS file: /cvs/gcc/gcc/gcc/tree-ssa-address.c,v
retrieving revision 2.1
diff -c -3 -p -r2.1 tree-ssa-address.c
*** tree-ssa-address.c  7 Jun 2005 12:01:28 -   2.1
--- tree-ssa-address.c  8 Jun 2005 14:34:35 -
*** addr_for_mem_ref (struct mem_address *ad
*** 198,205 
  
  templates_initialized = true;
  sym = gen_rtx_SYMBOL_REF (Pmode, ggc_strdup ("test_symbol"));
! bse = gen_raw_REG (Pmode, FIRST_PSEUDO_REGISTER);
! idx = gen_raw_REG (Pmode, FIRST_PSEUDO_REGISTER + 1);
  
  for (i = 0; i < 32; i++)
gen_addr_rtx ((i & 16 ? sym : NULL_RTX),
--- 198,205 
  
  templates_initialized = true;
  sym = gen_rtx_SYMBOL_REF (Pmode, ggc_strdup ("test_symbol"));
! bse = gen_raw_REG (Pmode, LAST_VIRTUAL_REGISTER + 1);
! idx = gen_raw_REG (Pmode, LAST_VIRTUAL_REGISTER + 2);
  
  for (i = 0; i < 32; i++)
gen_addr_rtx ((i & 16 ? sym : NULL_RTX),
Index: tree-ssa-loop-ivopts.c
===
RCS file: /cvs/gcc/gcc/gcc/tree-ssa-loop-ivopts.c,v
retrieving revision 2.78
diff -c -3 -p -r2.78 tree-ssa-loop-ivopts.c
*** tree-ssa-loop-ivopts.c  7 Jun 2005 22:44:56 -   2.78
--- tree-ssa-loop-ivopts.c  8 Jun 2005 14:34:35 -
*** add_cost (enum machine_mode mode)
*** 3149,3156 
  
start_sequence ();
force_operand (gen_rtx_fmt_ee (PLUS, mode,
!gen_raw_REG (mode, FIRST_PSEUDO_REGISTER),
!gen_raw_REG (mode, FIRST_PSEUDO_REGISTER + 1)),
 NULL_RTX);
seq = get_insns ();
end_sequence ();
--- 3149,3156 
  
start_sequence ();
force_operand (gen_rtx_fmt_ee (PLUS, mode,
!gen_raw_REG (mode, LAST_VIRTUAL_REGISTER + 1),
!gen_raw_REG (mode, LAST_VIRTUAL_REGISTER + 2)),
 NULL_RTX);
seq = get_insns ();
end_sequence ();
*** multiply_by_cost (HOST_WIDE_INT cst, enu
*** 3221,3228 
(*cached)->cst = cst;
  
start_sequence ();
!   expand_mult (mode, gen_raw_REG (mode, FIRST_PSEUDO_REGISTER), GEN_INT (cst),
!  NULL_RTX, 0);
seq = get_insns ();
end_sequence ();

--- 3221,3228 
(*cached)->cst = cst;
  
start_sequence ();
!   expand_mult (mode, gen_raw_REG (mode, LAST_VIRTUAL_REGISTER + 1),
!  gen_int_mode (cst, mode), NULL_RTX, 0);
seq = get_insns ();
end_sequence ();

*** multiplier_allowed_in_address_p (HOST_WI
*** 3247,3253 

if (!valid_mult)
  {
!   rtx reg1 = gen_raw_REG (Pmode, FIRST_PSEUDO_REGISTER);
rtx addr;
HOST_WIDE_INT i;
  
--- 3247,3253 

if (!valid_mult)
  {
!   rtx reg1 = gen_raw_REG (Pmode, LAST_VIRTUAL_REGISTER + 1);
rtx addr;
HOST_WIDE_INT i;
  
*** get_address_cost (bool symbol_present, b
*** 3305,3316 
HOST_WIDE_INT i;
initialized = true;
  
!   reg1 = gen_raw_REG (Pmode, FIRST_PSEUDO_REGISTER);
  
addr = gen_rtx_fmt_ee (PLUS, Pmode, reg1, NULL_RTX);
for (i = 1; i <= 1 << 20; i <<= 1)
{
! XEXP (addr, 1) = GEN_INT (i);
  if (!memory_address_p (Pmode, addr))
break;
}
--- 3305,3316 
HOST_WIDE_INT i;
initialized = true;
  
!   reg1 = gen_raw_REG (Pmode, LAST_VIRTUAL_REGISTER + 1);
  
addr = gen_rtx_fmt_ee (PLUS, Pmode, reg1, NULL_RTX);
for (i = 1; i <= 1 << 20; i

Problem with recompute_tree_invarant_for_addr_expr

2005-06-13 Thread Zdenek Dvorak
Hello,

I have run into the following problem:

recompute_tree_invarant_for_addr_expr says that address of a decl
is invariant if it is a decl in the current function:

if (decl_function_context (node) == current_function_decl)
  ... /* set TREE_INVARIANT  */

This has a small flaw -- in case NODE has variable size, it gets
allocated on stack dynamically, it may be allocated and deallocated
several times, and its address is no longer an invariant.

So I tried to fix the code as follows:

if (decl_function_context (node) == current_function_decl
&& TREE_CONSTANT (DECL_SIZE (node))
  ... /* set TREE_INVARIANT  */

The problem is that tree-nested.c builds frame_decl, adds fields to
it one by one, sometimes takes its address, and only in the end lays it
out.  This means that at the point the address is taken, DECL_SIZE is
not yet known and it is NULL.  In fact, in the moment the address is
taken, there seems to be no way how to tell whether the size of
frame_decl will be variable or not.

So I tried a conservative variant:

if (decl_function_context (node) == current_function_decl
&& DECL_SIZE (node)
&& TREE_CONSTANT (DECL_SIZE (node))
  ... /* set TREE_INVARIANT  */

This works up to moment verify_stmts is called, which ends with
"invariant not recomputed when ADDR_EXPR changed" (because now
frame_decl is laid out, and we know its size is constant, and that the
address is invariant).

Other possibility would be to be "optimistic":

if (decl_function_context (node) == current_function_decl
&& (!DECL_SIZE (node)
|| TREE_CONSTANT (DECL_SIZE (node)))
  ... /* set TREE_INVARIANT  */

which would probably work for frame_decl, as it should be allocated just
once (I hope), but is not very safe and still would ICE in
"invariant not recomputed when ADDR_EXPR changed", this time in case
if frame_decl were variable-sized.

It might be possible to special-case frame_decl in
recompute_tree_invarant_for_addr_expr, but it would be quite ugly hack,
and it is not somehow terribly simple to do anyway.

The other possibility is to somehow remember the list of places where
we take address of something based on frame_decl, and recompute
TREE_INVARIANT for them once frame_decl is laid out.

Any simpler idea how to solve this problem?

Zdenek


Re: Problem with recompute_tree_invarant_for_addr_expr

2005-06-13 Thread Zdenek Dvorak
Hello,

> On Mon, Jun 13, 2005 at 07:39:33PM +0200, Zdenek Dvorak wrote:
> > This has a small flaw -- in case NODE has variable size, it gets
> > allocated on stack dynamically, it may be allocated and deallocated
> > several times, and its address is no longer an invariant.
> > 
> > So I tried to fix the code as follows:
> > 
> > if (decl_function_context (node) == current_function_decl
> > && TREE_CONSTANT (DECL_SIZE (node))
> >   ... /* set TREE_INVARIANT  */
> 
> Such nodes should never be seen having their address taken.  They
> should be completely removed during gimplification.  So I think we
> should stop here and figure out why you care about such nodes.

good point, thanks (I don't precisely remember why I thought this change
was necessary; I did it while working on some changes to gimplification,
so perhaps I broke something there).

Zdenek


Re: Problem with recompute_tree_invarant_for_addr_expr

2005-06-13 Thread Zdenek Dvorak
Hello,

> On Mon, Jun 13, 2005 at 07:39:33PM +0200, Zdenek Dvorak wrote:
> > This has a small flaw -- in case NODE has variable size, it gets
> > allocated on stack dynamically, it may be allocated and deallocated
> > several times, and its address is no longer an invariant.
> > 
> > So I tried to fix the code as follows:
> > 
> > if (decl_function_context (node) == current_function_decl
> > && TREE_CONSTANT (DECL_SIZE (node))
> >   ... /* set TREE_INVARIANT  */
> 
> Such nodes should never be seen having their address taken.  They
> should be completely removed during gimplification.  So I think we
> should stop here and figure out why you care about such nodes.

OK, I remembered.  I put

if (is_gimple_min_invariant (t))

or

if (is_gimple_val (t))
  {
shortcut;
  }

type constructs on some places in gimplification.  Which causes
problems, since is_gimple_min_invariant only checks the TREE_INVARIANT
flag for ADDR_EXPRs, and ADDR_EXPRs of variable sized decls then leak through.
The change to recompute_tree_invarant_for_addr_expr was intended to
prevent this.

Zdenek


Re: Problem with recompute_tree_invarant_for_addr_expr

2005-06-14 Thread Zdenek Dvorak
Hello,

> On Mon, Jun 13, 2005 at 10:54:23PM +0200, Zdenek Dvorak wrote:
> > OK, I remembered.  I put
> > 
> > if (is_gimple_min_invariant (t))
> > 
> > or
> > 
> > if (is_gimple_val (t))
> >   {
> > shortcut;
> >   }
> > 
> > type constructs on some places in gimplification.
> 
> With an aim toward speeding up gimplification, I guess.  Any idea how
> much benefit you get from that?

not really.  The problem anyway is that it does not quite work even when
the decls with variable size are taken care of, as there are other
constructions with that ADDR_EXPR gets marked TREE_INVARIANT, but that
are not gimple.

Zdenek

> As for the original question, you could try modifying tree-nested.c to
> use a local routine for creating the ADDR_EXPRs of the frame object,
> and force TREE_INVARIANT on there.


Re: -floop-optimize2

2005-06-22 Thread Zdenek Dvorak
Hello,

> I've noticed that -floop-optimize2 tends to be a pessimism on many
> algorithms.
> 
> I'm hesitant to file this as a "bug", given that -floop-optimize2 is a
> "new" replacement for the older loop optimizer.
> 
> Is -floop-optimize2 still in development, and not ready yet -- or are
> the problems I'm seeing something that should be analyzed and reported
> as a bug?

if you have testcases, I would be definitely interested.

Zdenek


Re: -fprofile-arcs changes the structure of basic blocks

2005-06-23 Thread Zdenek Dvorak
Hello,

> I want to use profiling information. I know there're two relevent
> fields in each basic block, count and frequency. I want to use
> frequency because the compiled program is for another architecture so
> it cannot run on the host.
> 
> I use -fprofile-arcs. 

this does not make much sense to me.  If you cannot run the compiled
program, adding instrumentation with -fprofile-arcs is useless.  In this
case, you may just use the guessed frequencies (that you get by default
without any flags).

Zdenek

> And I can see the frequency value when I debug
> cc1. But I happen to realize that when I add -fprofile-arcs, it change
> the the whole structure of basic block. I compared the vcg output
> files with and without the -fprofile-arcs. I found they're totally
> different.
> 
> My question is why it is so? I want to know the profiling info, but if
> profiling info I get is for another different structure of basic
> block, it's useless to me.
> 
> Where do I go wrong here? Which option is the suitable in this case?
> The gcc version is 3.3.3. Thanks.
> 
> 
> Regards,
> Timothy


Re: Question on tree-ssa-loop-ivopts.c:constant_multiple_of

2005-06-30 Thread Zdenek Dvorak
Hello,

> Isn't it the case that *any* conversion can be stripped for the purpose
> of this routine?  I get an ICE compiling the Ada RTS a-strfix.adb because
> of that.  The following seems to fix it, but is it right?

no, it is not.  The uses of constant_multiple_of expect that it
determines whether TOP = CST * BOT, except for no-op casts.  It is true
that there are some sequences of casts that in the end satisfy this, but
are not handled by constant_multiple_of, but you would need to be more
careful when changing it.

The bug you describe most probably is PR 21963 (and there is a patch for
this PR submitted).

Zdenek

> *** tree-ssa-loop-ivopts.c26 Jun 2005 21:21:32 -  2.82
> --- tree-ssa-loop-ivopts.c29 Jun 2005 21:38:29 -
> *** constant_multiple_of (tree type, tree to
> *** 2594,2599 
> bool negate;
>   
> !   STRIP_NOPS (top);
> !   STRIP_NOPS (bot);
>   
> if (operand_equal_p (top, bot, 0))
> --- 2594,2603 
> bool negate;
>   
> !   /* For determining the condition above, we can ignore all conversions, not
> !  just those that don't change the mode, so can't use STRIP_NOPS here. */
> !   while (TREE_CODE (top) == NOP_EXPR || TREE_CODE (top) == CONVERT_EXPR)
> ! top = TREE_OPERAND (top, 0);
> !   while (TREE_CODE (bot) == NOP_EXPR || TREE_CODE (bot) == CONVERT_EXPR)
> ! bot = TREE_OPERAND (bot, 0);
>   
> if (operand_equal_p (top, bot, 0))


Re: PR22336 (was: Problem with tree-ssa-loop-ivopts.c:get_computation-cost)

2005-07-12 Thread Zdenek Dvorak
Hello,

> If no one is suggesting an alternative, tested on x86 and x86_64-linux
> where it restores bootstrap (at last :), ok to commit?

I think the patch is fine (although of course I cannot approve it).  The
more proper fix would be to never expand such expressions from ivopts;
I have a patch for this, but it needs more testing and benchmarking.

Zdenek

> We're down to 6 ACATS FAIL on x86_64 and 8 on x86:
> 
> http://gcc.gnu.org/ml/gcc-testresults/2005-07/msg00654.html
> http://gcc.gnu.org/ml/gcc-testresults/2005-07/msg00653.html
> 
> common:
> 18818: cd10002 (run) stream attributes
> 18819: cdd2a02 (run, works at -O0 everywhere) works with -O2 -fno-tree-sra
> 22333: c34007p c34007r c45282b (run) spurious discriminant CONSTRAINT_ERROR   
> 
> x86 only:
> 18659: c32001e c64105b c95086b (ICE x86, ppc, ia64, works on x86_64, pass 
> everywhere at -O0) works with -O2 -fno-tree-sra
> 
> x86_64 only:
> 20548: c52103x (run) segfault at runtime on x86_64 and hppa
> 
> Laurent
> 
> 2005-07-12  Richard Kenner  <[EMAIL PROTECTED]>
>   Laurent GUERBY  <[EMAIL PROTECTED]>
> 
>   PR tree-optimization/22336
>   * function.c (record_block_change): Check for 
>   cfun->ib_boundaries_block.
> 
> 
> Index: function.c
> ===
> RCS file: /cvs/gcc/gcc/gcc/function.c,v
> retrieving revision 1.635
> diff -u -r1.635 function.c
> --- function.c  7 Jul 2005 21:04:31 -   1.635
> +++ function.c  10 Jul 2005 19:06:10 -
> @@ -5502,6 +5502,9 @@
>if (!block)
>  return;
> 
> +  if(!cfun->ib_boundaries_block)
> +return;
> +
>last_block = VARRAY_TOP_TREE (cfun->ib_boundaries_block);
>VARRAY_POP (cfun->ib_boundaries_block);
>n = get_max_uid ();
> 
> 
> On Tue, 2005-07-05 at 10:31 +0200, Laurent GUERBY wrote:
> > This is http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22212
> > and is the problem blocking Ada bootstrap on x86_64-linux,
> > it would be great to move forward on this.
> > 
> > Laurent
> > 
> > On Thu, 2005-06-30 at 18:18 -0400, Richard Kenner wrote:
> > > This function generates RTL from an expression to see how many RTL insns
> > > it is.  But this causes a problem compiling the Ada ACATS test cxa4006.
> > > 
> > > The problem is when part of the expression has a location.  In that
> > > case, record_block_change is called and that relies on
> > > cfun->ib_boundaries_block being set.  But it isn't because we aren't
> > > expanding stuff "for real".  A kludge would be to test that field, but
> > > what's the proper way?
> > > 
> > > Also, seq_cost really should be looking at next_real_insn, not NEXT_INSN,
> > > since any notes shouldn't be counted.
> > > 
> 


Re: PR22336 (was: Problem with tree-ssa-loop-ivopts.c:get_computation-cost)

2005-07-12 Thread Zdenek Dvorak
Hello,

> I think the patch is fine (although of course I cannot approve it).  
> 
> I, as it's author, do not.  But, as I keep saying, I don't know what
> the proper fix is.

preventing record_block_change from working in this situation seems a
quite clean solution to me; of course, not expanding expressions with
TREE_BLOCK set (when determining their cost) at all would be better, but
somewhat more complicated.

> The more proper fix would be to never expand such expressions from ivopts;
> 
> What are "such expressions"?

Expressions with TREE_BLOCK set; i.e. those coming from the compiled program.

Zdenek


Re: Question on updating ssa for virtual operands (PR tree-optimization/22543)

2005-08-13 Thread Zdenek Dvorak
Hello,

> > The other thing we could try to do is put virtual variables in loop-closed-
> > form, at least just before the vectorizer, and at least just for some
> > loops. Does this sound reasonabale? (By the way, why don't we keep virtual
> > variables in loop-closed-form?)
> 
> We used to, nobody could come up with a good reason to keep doing it, so
> we stopped.

there were couple of good reasons (consistency, faster updating
after loop unrolling and loop header copying).  However, LCSSA for virtual
operands is memory expensive and this overweighted the advantages (in
particular, slower updating during loop manipulations is compensated
by everything else being faster).

Zdenek


Re: -fprofile-generate and -fprofile-use

2005-08-31 Thread Zdenek Dvorak
Hello,

> >A more likely source of performance degradation is that loop unrolling
> >is enabled when profiling, and loop unrolling is almost always a bad
> >pessimization on 32 bits x86 targets.
> 
> To clarify, I was compiling with -funroll-loops and -fpeel-loops
> enabled in both cases.
> 
> The FDO slowdown in my case was caused by the presence of some loop
> invariant code that was getting removed from the loop by the loop
> optimizer pass in the non-FDO case.

you may try adding -fmove-loop-invariants flag, which enables new
invariant motion pass.

Zdenek


Re: doloop-opt deficiency

2005-08-31 Thread Zdenek Dvorak
Hello,

> Dale Johannesen wrote:
> > I think this is a simple pasto; this code was evidently copied from
> > the previous block:
> > 
> 
> I don't think that this was a simple pasto.  The code looks correct.
> We have the same code in tree-ssa-loop-niter.c around line 436, since
> we inherited this code from the rtl-level.

note the extra negation in loop-iv.c.  I think the fix is OK.

Zdenek

> > Index: loop-iv.c
> > ===
> > RCS file: /cvs/gcc/gcc/gcc/loop-iv.c,v
> > retrieving revision 2.35
> > diff -u -b -c -p -r2.35 loop-iv.c
> > cvs diff: conflicting specifications of output style
> > *** loop-iv.c   21 Jul 2005 07:24:07 -  2.35
> > --- loop-iv.c   29 Aug 2005 23:34:12 -
> > *** iv_number_of_iterations (struct loop *lo
> > *** 2417,2423 
> >   tmp0 = lowpart_subreg (mode, iv0.base, comp_mode);
> >   tmp1 = lowpart_subreg (mode, iv1.base, comp_mode);
> >   
> > ! bound = simplify_gen_binary (MINUS, mode, mode_mmin,
> >lowpart_subreg (mode, step, comp_mode));
> >   if (step_is_pow2)
> > {
> > --- 2417,2423 
> >   tmp0 = lowpart_subreg (mode, iv0.base, comp_mode);
> >   tmp1 = lowpart_subreg (mode, iv1.base, comp_mode);
> >   
> > ! bound = simplify_gen_binary (PLUS, mode, mode_mmin,
> >lowpart_subreg (mode, step, comp_mode));
> >   if (step_is_pow2)
> > {
> 
> 
> I don't think this fix is correct: 'bound' is used to test whether an
> overflow has occured, and in the following code that uses 'bound', it
> is expected to overflow for the example that you have provided:
> 
> /* If s is power of 2, we know that the loop is infinite if
>a % s <= b % s and a - s overflows.  */
> assumption = simplify_gen_relational (reverse_condition (cond),
>   SImode, mode,
>   bound, tmp0);
> 
> > 
> > 
> > The code as it was computed -2147483648-256 which overflows.
> 
> this is the right behavior.
> 
> > We noticed that the simple loop here
> 
> > extern int a[];
> > int foo(int w) {
> >   int n = w;
> >   while (n >= 512)
> > {
> > a[n] = 42;
> > n -= 256;
> > }
> >   }
> 
> > 
> > 
> > was being treated as ineligible for the doloop modification. 
> 
> The problem in this example is that the initial value of the induction
> variable is not known: it is a parameter 'w'.  Consequently we have
> not enough information to determine how many times it wraps, nor for
> estimating the number of iterations, if you're using the -fwrapv
> semantics.  
> 
> I'm not sure whether at rtl level we have this distinction of
> wrapping/non wrapping operations, but from the code that I've read,
> everything wraps (probably Roger Sayle knows more about the wrapping
> behavior that is expected at rtl level).  Having said this, we have to
> adjust the code at the tree level such that it catches this case
> (conditionally on signed iv and !fwrapv).
> 
> sebastian


Re: -fprofile-generate and -fprofile-use

2005-09-02 Thread Zdenek Dvorak
Hello,

> > >you may try adding -fmove-loop-invariants flag, which enables new
> > >invariant motion pass.
> > 
> > That cleaned up both my simplified test case, and the code it
> > originated from.  It also cleaned up a few other cases where I
> > was noticing worse performance with FDO enabled.  Thanks!!
> > 
> > Perhaps this option should be enabled by default when doing FDO
> > to replace the loop invariant motions done by the recently
> > disabled loop optimize pass.
> 
> This sounds like sane idea.  Zdenek, is -fmove-loop-invariants dangerous
> in some way or just disabled because old loop does the same?

-fmove-loop-invariants is enabled by default on killloop-branch, and
passes bootstrap & regtesting on i686 and x86_64; the branch does not
bootstrap for me on ia64 and ppc (but neither does mainline).  I am
not aware of any correctness/ICE problems with -fmove-loop-invariants,
but given that it is disabled by default, this does not say much.

Zdenek


Problem with dom

2005-09-11 Thread Zdenek Dvorak
Hello,

I have run into following problem with dom. One of the places
where cfg_altered is set is wrong:

Index: tree-ssa-dom.c
===
RCS file: /cvs/gcc/gcc/gcc/tree-ssa-dom.c,v
retrieving revision 2.124.2.1
diff -c -3 -p -r2.124.2.1 tree-ssa-dom.c
*** tree-ssa-dom.c  2 Sep 2005 16:43:11 -   2.124.2.1
--- tree-ssa-dom.c  10 Sep 2005 16:47:06 -
*** tree_ssa_dominator_optimize (void)
*** 479,485 
if (cfg_altered)
  free_dominance_info (CDI_DOMINATORS);
  
!   cfg_altered = cleanup_tree_cfg ();
  
if (rediscover_loops_after_threading)
{
--- 479,485 
if (cfg_altered)
  free_dominance_info (CDI_DOMINATORS);
  
!   cfg_altered |= cleanup_tree_cfg ();
  
if (rediscover_loops_after_threading)
{

i.e. dom is iterating only if cfg cleanup changed something.
In particular, if jump threading is performed, but cfg cleanup
does not change anything (which happens sometimes), dom does
not iterate.  For testcases like

if (n > 0)
  loop;
if (n > 0)
  loop;
...
if (n > 0)
  loop;

it causes only some of the jumps to be threaded, and 
we get code that looks like

if (n > 0)
  {
loop;
loop;
loop;
goto next;
  }

if (n > 0)
  {
next:;
loop;   (*)
loop;
loop;
  }

This is very bad for analysis of # of iterations: Neither
of the "if (n > 0)" dominates loop (*), thus we are unable to
determine that n must be positive there, and we cannot determine
# of iterations of the loop precisely.

So I tried the obvious fix (the patch above).  This fixes the problem,
but also causes dom to completely unroll loops sometimes, for
example for

void bar(int);

void foo(void)
{
  int i;

  for (i = 0; i < 1; i++)
{
  if (i & 65536)
abort ();
  bar(i);
}
}

(two exit conditions are necessary; for some reason, this won't
happen if "if (i & 65536) abort ();" is removed).  This of course
is not desirable (and it is very, very slow).  Any idea how to fix
the first problem without introducing the second one?

Zdenek


Re: Problem with dom

2005-09-11 Thread Zdenek Dvorak
Hello,

> > I have run into following problem with dom. One of the places
> > where cfg_altered is set is wrong:
> It's not really wrong, it's set this way on purpose, specifically
> to avoid the compile-time explosion you're seeing.

This works by pure luck only, I think.  In fact, the only thing
that prevents dom from unrolling loops just now is

  if (need_ssa_update_p ())
return false;

condition in tree_can_merge_blocks_p, that prevents block merging
and keeps cfg unchanged.  While I do not have a testcase for this,
I would not be suprised if dom could be persuaded to completely unroll
a loop as it is.

Nevertheless, could you please add at least a short comment explaining
that this obvious mistake is an intended one to tree-ssa-dom.c?

> > So I tried the obvious fix (the patch above).  This fixes the problem,
> > but also causes dom to completely unroll loops sometimes, for
> > example for
> > 
> > void bar(int);
> > 
> > void foo(void)
> > {
> >   int i;
> > 
> >   for (i = 0; i < 1; i++)
> > {
> >   if (i & 65536)
> > abort ();
> >   bar(i);
> > }
> > }
> > 
> > (two exit conditions are necessary; for some reason, this won't
> > happen if "if (i & 65536) abort ();" is removed).  This of course
> > is not desirable (and it is very, very slow).  Any idea how to fix
> > the first problem without introducing the second one?
> I think the way to do this is to somehow limit the iterations when
> one or more backedges there threaded.
> 
> This may also fall out of the work Steven has started.  His changes
> limit DOM's ability to thread jumps when blocks get large, which is
> precisely what happens when DOM unrolls this loop completely. We
> get a single very very large basic block.  You might poke around with
> his change a little.

No, it unfortunately cannot help.  The new blocks are placed before the
loop, and jump threading only occurs inside the loop, so the sizes of
copied blocks remain small.

I guess the only real fix is to remove jump threading from dom and make
a separate, non-iterative pass for it.  Which of course is quite a bit
of work.

Zdenek


Must mode of set and REG_EQUIV note match?

2005-09-13 Thread Zdenek Dvorak
Hello,

I have the following insn:

(insn 522 521 523 87 (set (reg:SI 308)
(reg:SI 0 ax)) 40 {*movsi_1} (nil)
(insn_list:REG_RETVAL 520 (expr_list:REG_EQUAL (parity:DI (reg:DI 248 [ 
D.1874 ]))
(nil

Is this correct? I have a piece of code that breaks if mode of the
assigned register is not equal to mode of the expression in REG_EQUAL
note, and I wonder whether I should be fixing this piece of code, or the
code that sets the REG_EQUAL note on this insn.

Zdenek


Re: Loop information

2005-09-14 Thread Zdenek Dvorak
Hello,

> Can someone please help me getting the following information?
> 
> 1) I would like to obtain the loop bounds (constant case) of all nested 
> loops
> of a RTL insn. Is there a data structure over which I can iterate to get 
> bounds
> for each nested loop of a RTL insn?
>
> 3) Can I determine if a pseudo register (RTX) is an induction variable 
> (linear) or not?
> Which data structure gives me this information?

see loop-iv.c (or a bit improved version of it on killloop-branch).
You can analyze induction variables and determine # of iterations of a
loop using it, which should give you the information you need.  But the
rtl level iv analysis is not too powerful, just sufficiently for the
current uses.

> 2) Is there a way of determining sequences as mentioned in the paper
> "Beyond Induction Variables: Detecting and Classifying Sequences Using a 
> Demand-Driven SSA From" by Gerlek, Stoltz and Wolfe? 

Not on RTL level.  On tree level, see tree-scalar-evolutions.c

> 4) At RTL level, array accesses convert to MEM expressions. I was 
> wondering
> if I can obtain the source level array name from the MEM expression.

You might be able to extract this information from MEM_EXPR.

Zdenek


Re: [RFC] propagating loop dependences from trees to RTL (for SMS)

2005-09-22 Thread Zdenek Dvorak
Hello,

> As a follow up to http://gcc.gnu.org/ml/gcc/2005-04/msg00461.html
> 
> I would like to improve SMS by passing data dependencies information
> computed in tree-level to rtl-level SMS. Currently data-dependency graph
> built for use by SMS has an edge for every two data references (i.e. it's
> too conservative). I want to check for every loop, using functions defined
> in tree-data-ref.c, if there are data dependencies in the loop. The problem
> is how to pass this information to SMS (note - we're only trying to convey
> whether there are no dependencies at all in the loop - i.e. one bit of
> information). The alternatives being considered are:
> 
> 1. Introduce a new BB bit flag and set it for the header BB of a loop that
> has no data dependencies. This approach already works, but only if the old
> loop optimizer (pass_loop_optimize) is disabled (otherwise the bit doesn't
> survive). One potential problem is that the loop header BB may change
> between the tree-level and SMS as result of some optimization pass (can
> that really happen?)

yes -- jump threading may change loop structures, although this is
fairly easy to prevent.  Loop optimizations can of course alter the
structure of loops, but they should be able to preserve this type of
informations.

> 2. Use a bitmap (as suggested in
> http://gcc.gnu.org/ml/gcc-patches/2005-03/msg01353.html) that is indexed
> using the BB index.  In my case I need to define and use the property within 
> different
> functions. I can define a static function
>  "set_and_check_nodeps(bb_loop_header)" and define a bitmap there.
>  Like the previous solution, The problem that can arise is that some
> intermediate optimizations can change the header of the loop. By the way,
> is it guaranteed that a BB keeps the same index throught the entire
> compilation?

No -- basic blocks are "compacted" from time to time, to fill in holes
created by removal of dead blocks.  Also, basic block numbers might
change due to basic block splitting/merging in some optimizations.

> 3. Use insn_notes - introduce a new note "NOTE_INSN_NO_DEPS_IN_LOOP" to be
> inserted after the "NOTE_INSN_LOOP_BEG" for relevant loops.

NOTE_INSN_LOOP_BEG should most definitely disappear in 4.2, and
introducing insn notes seems like a bad idea to me.

> 4. Other ideas?

Preserving the information about loops throughout the optimizations, and
just keeping this information attached at the loop description would by
far be the cleanest approach -- admitedly also the one that requires the
greatest amount of work.

Zdenek


Re: Possible bug in tree-ssa-loop-ivopts.c:prepare_decl_rtl?

2005-09-28 Thread Zdenek Dvorak
Hello,

> There don't seem to be many comments explaining why we're doing
> what we're doing here, so I'm not sure whether this was the intended
> behaviour or not.  Do we really want to kick out existing DECL_RTLs,
> or is there simply a missing !DECL_RTL_SET_P in the DECL_P condition:

I think that adding the !DECL_RTL_SET_P should work.  Nevertheless,
prepare_decl_rtl and friends should disappear when killloop-branch is merged
(I have a patch for this in testing, not yet commited to the branch).

Zdenek


[RFC] Not using VAR_DECLs for temporary variables

2005-12-21 Thread Zdenek Dvorak
Hello,

during expansion of expressions, gimplification creates a lot of temporary
variables.  VAR_DECLs are fairly large (88 bytes on i686), and
additionally an annotation (44 bytes on i686) are allocated for each of
them (some of them even get names, for additional ~50 bytes of memory
each).  This is quite wasteful.

Basic observation is that these temporaries have exactly one definition.
It is therefore possible to generate directly SSA_NAMEs for them,
setting SSA_NAME_VAR to NULL (and be careful to never rewrite them out
of ssa).  Theoretically, it should need only a few changes to gimplification,
into- and out-of-ssa, and expand, and everything should work.  Unfortunately,
lot of SSA passes looks on SSA_NAME_VAR.  This seems wrong to me, but
changing all the places would be very hard.

However, if we instead put a really minimal DECL node (we allocate only
one per type, and consisting just of struct tree_common -- 16 bytes) as
a SSA_NAME_VAR, things are much easier to get working.

The patch below is the first attempt for the idea (completely untested,
I just made it to work somehow so that I may measure its usefullness).
We get the following results (for combine.i):

-O0:
 TOTAL :   3.9122668 kB
 TOTAL :   3.8722096 kB (-2.5%)

-O1:
 TOTAL :  20.7734640 kB
 TOTAL :  20.2633688 kB (-2.7%)

-O2:
 TOTAL :  25.9341116 kB
 TOTAL :  25.8040582 kB (-1.2%)

That is, we need about 2% less memory, and also the compilation seems to
be a bit faster (maybe because of the memory savings, or possibly
because less work needs to be done in into- and outof- ssa passes).

However, this idea is a bit hackish, so I would like to hear an opinion
about it before I spend more time on testing it.

Zdenek

Index: tree-into-ssa.c
===
*** tree-into-ssa.c (revision 108807)
--- tree-into-ssa.c (working copy)
*** mark_def_sites (struct dom_walk_data *wa
*** 650,655 
--- 650,668 
SSA_OP_USE | SSA_OP_VUSE | SSA_OP_VMUSTKILL)
  {
tree sym = USE_FROM_PTR (use_p);
+ 
+   if (TREE_CODE (sym) == SSA_NAME)
+   {
+ /* Only gimplifier temporaries are now in ssa form.  We had already
+seen the definition, so we have assigned a version to the ssa
+name, and the definition must dominate it.  */
+ gcc_assert (TMP_NAME_P (sym));
+ gcc_assert (SSA_NAME_VERSION (sym) != 0);
+ gcc_assert (dominated_by_p (CDI_DOMINATORS, bb,
+ bb_for_stmt (SSA_NAME_DEF_STMT (sym;
+ continue;
+   }
+ 
gcc_assert (DECL_P (sym));
if (!bitmap_bit_p (kills, DECL_UID (sym)))
set_livein_block (sym, bb);
*** mark_def_sites (struct dom_walk_data *wa
*** 674,679 
--- 687,700 
/* Now process the defs and must-defs made by this statement.  */
FOR_EACH_SSA_TREE_OPERAND (def, stmt, iter, SSA_OP_DEF | SSA_OP_VMUSTDEF)
  {
+   if (TREE_CODE (def) == SSA_NAME)
+   {
+ gcc_assert (TMP_NAME_P (def));
+ gcc_assert (SSA_NAME_VERSION (def) == 0);
+ gcc_assert (SSA_NAME_DEF_STMT (def) == stmt);
+ register_tmp_ssa_name (def);
+ continue;
+   }
gcc_assert (DECL_P (def));
set_def_block (def, bb, false);
bitmap_set_bit (kills, DECL_UID (def));
*** rewrite_stmt (struct dom_walk_data *walk
*** 1039,1044 
--- 1060,1067 
  SSA_OP_ALL_USES|SSA_OP_ALL_KILLS)
{
tree var = USE_FROM_PTR (use_p);
+   if (TMP_NAME_P (var))
+ continue;
gcc_assert (DECL_P (var));
SET_USE (use_p, get_reaching_def (var));
}
*** rewrite_stmt (struct dom_walk_data *walk
*** 1048,1053 
--- 1071,1078 
  FOR_EACH_SSA_DEF_OPERAND (def_p, stmt, iter, SSA_OP_ALL_DEFS)
{
tree var = DEF_FROM_PTR (def_p);
+   if (TMP_NAME_P (var))
+ continue;
gcc_assert (DECL_P (var));
SET_DEF (def_p, make_ssa_name (var, stmt));
register_new_def (DEF_FROM_PTR (def_p), &block_defs_stack);
Index: tree-pretty-print.c
===
*** tree-pretty-print.c (revision 108807)
--- tree-pretty-print.c (working copy)
*** dump_generic_node (pretty_printer *buffe
*** 708,713 
--- 708,717 
dump_decl_name (buffer, node, flags);
break;
  
+ case GVAR_DECL:
+   pp_printf (buffer, "", DECL_UID (node));
+   break;
+ 
  case RESULT_DECL:
pp_string (buffer, "");
break;
Index: tree.c
===
*** tree.c  (revision 108807)
--- tree.c  (working copy)
*** static const

  1   2   3   >