complete_unrolli / complete_unroll

2009-08-19 Thread Albert Cohen
When debugging graphite, we ran into code bloat issues due to
pass_complete_unrolli being called very early in the non-ipa
optimization sequence. Much later, the full-blown pass_complete_unroll
is scheduled, and this one does not do any harm.

Strangely, this early unrolling pass (tuned to only unroll inner loops)
is only enabled at -O3, independently of the -funroll-loops flag.

Does anyone remember why it is there, for which platform it is useful,
and what are the perf regressions if we remove it?

My guess is that it may only harm... disabling or damaging the
effectivenesss of the (loop-level) vectorizer and increasing compilation
time.

Thanks,
Albert

PS: When this question is solved, it will also be interesting to start a
serious discussion on how to improve the flexibility in customizing pass
ordering and parameterization of passes depending on the target. Grigori
Fursin's work shows the strong benefits and already provides a working
prototype. This question is independent of whether the customization is
done by experts or machine-learning/statistical techniques.


Re: complete_unrolli / complete_unroll

2009-08-19 Thread Albert Cohen
Richard Guenther wrote:
> 2009/8/19 Albert Cohen :
>> When debugging graphite, we ran into code bloat issues due to
>> pass_complete_unrolli being called very early in the non-ipa
>> optimization sequence. Much later, the full-blown pass_complete_unroll
>> is scheduled, and this one does not do any harm.
>>
>> Strangely, this early unrolling pass (tuned to only unroll inner loops)
>> is only enabled at -O3, independently of the -funroll-loops flag.
>>
>> Does anyone remember why it is there, for which platform it is useful,
>> and what are the perf regressions if we remove it?
> 
> The early loop unrolling pass is very important to remove abstraction
> penalty for C++ programs that chose not to implement manual
> unrolling by relying on the inliner and template metaprogramming.
> 
> In tramp3d you for example see (very much simplified, intermediate
> state after some inlining):
> 
>  foo (int i, int j, int k)
> {
>  double a[][][];
>  int index[3];
>  const int dX[3] = { 1, 0, 0 };
> ...
>  for (m=0; m<3; ++m)
>   index[m] = 0;
>  index[0] = i;
>  index[1] = j;
>  index[2] = k;
>   ... a[index[0]][index[1]][index[2]];
>  for (m=0; m<3; ++m)
>   index[m] += dx[m];
> ... a[index[0]][index[1]][index[2]];
> 
> etc. to access a[i][j][k] and a[i+1][j][k].
> 
> There is an absoulte need to unroll these simple loops before
> CSE otherwise loop optimizations have no chance on optimizing
> anything here.
> 
> Another benchmark that degrades considerably without early
> unrolling is 454.calculix (in fact that one was the reason to
> add this pass).
> 
>> My guess is that it may only harm... disabling or damaging the
>> effectivenesss of the (loop-level) vectorizer and increasing compilation
>> time.
> 
> No it definitely does not.  But it has one small issue in that it sometimes
> also unrolls an outermost loop IIRC, that could be fixed.

Thanks a lot for the quick and detailed response.

It is more difficult than I thought, then :-( We'll think more, and
maybe come up with yet another pass ordering proposal, but definitely
this tramp3d code deserves to be processed by graphite AFTER
unrolling+cse has done its specialization trick.

Albert


Re: complete_unrolli / complete_unroll

2009-08-19 Thread Albert Cohen
Albert Cohen wrote:
> Thanks a lot for the quick and detailed response.
> 
> It is more difficult than I thought, then :-( We'll think more, and
> maybe come up with yet another pass ordering proposal, but definitely
> this tramp3d code deserves to be processed by graphite AFTER
> unrolling+cse has done its specialization trick.

One way out would be to make unrolli pass a little more careful. As you
suggest, the heuristic is already not quite satisfactory as it sometimes
unrolls outemost loops.

A better heursitic would be to run through the different cases where
unrolling helps specialization (e.g., the subscripts of subscripts in
the tramp3d example), and check for these patterns explicitely. But this
is not easy to implement (or to make it robust, and not too
syntax-dependent).

Albert


Re: complete_unrolli / complete_unroll

2009-08-19 Thread Albert Cohen

Richard Guenther wrote:

gfortran.dg/reassoc_4.f, the hottest loop from calculix.


Thanks.

This example is slightly different. Graphite should be able to handle it 
with loop fusion rather than pre-unrolling + cse. But I agree that the 
unrolling + cse approach also makes sense (and does not depend on the 
same legality constraints as loop fusion).


This makes me think of a simple, general criterion to detect cases where 
pre-unrolling of inner loop helps further cse and loop optimizations.
The idea is to unroll only when we can see some evidence of array 
references that are not presently loop-invariant that would be made 
(outer)-loop invariant via full unrolling of some inner loop.
This can be implemented by complementing the current heuristic (or its 
complementary enhancements by Honza) with an additional condition, only 
enabled when running it with the "i" (inner) flag (which should probably 
be renamed if we do implement this...).


The simplest, weakest condition I can think of would be to traverse all 
array references in the region enclosed by the loop-to-be-unrolled, 
compute the SCEV for each one, instanciate it in the loop's context, and 
checking if it only depends on the loop counter, as well as outer loop 
counters or parameters.


This condition would a priori pass on the tramp3d and reassoc_4 cases. 
Yet it is probably too weak and will still pass on many codes where 
unrolling would probably not help at all... and probably harm.
If this is the case, we should consider multiple loops to be unrolled, 
and the combined effect of unrolling ALL of these, resulting in complete 
instanciation of the array subscripts with constants. This is a very 
special case, again satisfied by our two motivating examples. Maybe it 
will be too specific and we'll have performance regressions... It 
remained to be investigated if we have to go through a stricter 
condition than the first, weak one I proposed.


If this is not clear, I can write some pseudo-code to clarify :-).

Albert


Re: complete_unrolli / complete_unroll

2009-08-20 Thread Albert Cohen
Richard Guenther wrote:
>> If this is not clear, I can write some pseudo-code to clarify :-).
> 
> Can't we use graphite to re-roll loops?  That is, compress the
> polyhedron by introducing a new parameter?  But maybe I am
> not good at guessing what your initial bloat issue looks like.
> 
> The reason I'm asking is that there is enough code out in the
> wild that has loops with manually unrolled bodies - I have seen
> up to 16 times here.
> 

I agree that the conditions I propose are not as reliable as unrolling
and checking if it helps. At some point, this kind of sandboxing of the
IR to explore a tree of optimizations would be interesting... except for
compilation time and memory usage :-( Great for iterative and machine
learning optimization anyway.

Regarding your rerolling question, it is currently not known.
There is indeed a nice parallel between rerolling and the code
generation algorithms in CLooG. But this is only for rerolling right
after unrolling. The real problem is rerolling after a sequence of
optimizations that took advantage of prior unrolling. In this case, we
have an algorithm equivalence problem, very general and very nasty.
Polyhedral approaches do help but so far did not do much more than very
theoretical papers and toy prototypes (I can give references if you are
interested).

Clearly, it would be nice to have a rerolling pass, it would also help
the intra-block vectorization (there are specific papers on this, and
preliminary support in the vectorizer), but it is not something people
understand well.

We'll wait a little, but more feedback on conditions to stricten the
application of the early unrolling pass will be helpful, then one of the
Graphite developers may gets his or her hand dirty on it.

Albert


Re: [graphite] Cleanup of command line parameters

2008-10-10 Thread Albert Cohen

Tobias Grosser wrote

Hi graphities,

graphite consists of four flags "-floop-block", "-floop-interchange",
"-floop-stripmine" and "-fgraphite".

If any of these flags is set, we enable the graphite pass and we search
for SCoPs.
For every SCoP we try to apply transformations specified with
"-floop-block", "-floop-interchange" or "-floop-stripmine". But only if
we could apply a transformation, we replace the SCoP with new code
generated from the GRAPHITE data structures. Otherwise we do not touch
the GIMPLE representation.
If we only specify "-fgraphite", we never generate code for any SCoP, as
we never tried any transformation. So just using "-fgraphite" is
useless.

What is missing is a way to make GRAPHITE always generate code, even if
we did not apply any transformation.

This has two benefits:
- We can check the overhead the GIMPLE -> GRAPHITE -> GIMPLE
transformation costs.
- Wider test coverage, as we are able to run the code generator for
every SCoP, not only the SCoPs, we are able to transform.


Now:


-fgraphite: Do nothing.
-floop-block, -floop-interchange, -floop-stripmine: Try these
transformations.

Only "-fgraphite":   Do nothing
Only "-floop-block": Only loop blocked SCoPs are changed.
"-fgraphite -floop-block":   Only loop blocked SCoPs are changed.

One solution: Reuse -fgraphite
--

-fgraphite: Enable always code generation
-floop-block, -floop-interchange, -floop-stripmine: Try these
transformations.

Only "-fgraphite":   The identity transformation for all SCoPs.
Only "-floop-block": Only loop blocked SCoPs are changed.
"-fgraphite -floop-block":   All SCoPs are are changed. Some SCoPs are
			 loop blocked, others just the identity 
			 transformation.


This allows us to get all the benefits but (mis)uses the -fgraphite
flag. At the moment it has no other meaning, but I could think of it
containing an automatic optimizer, that applies all available graphite
transformations or selects them automatically.

The other solution is: Use -fgraphite-identity
--

-fgraphite: Enable always code generation
-floop-block, -floop-interchange, -floop-stripmine, -fgraphite-identity:
Try these transformations.

Only "-fgraphite":   Do nothing. Free for later use.
Only "-floop-block": Only loop blocked SCoPs are changed.
Only "-fgraphite-identity":  The identity transformation for all SCoPs.
"-fgraphite-identity
 -floop-block":  All SCoPs are are changed. Some SCoPs are
			 loop blocked, others just the identity 
			 transformation.


This makes sense, as we only generate code for applied and enabled
transformations. These are loop-blocking, interchange, stripmine and
- may be - the new and very easy to write identity transformation, that
does nothing, but enable code generation.

What du you think about these different solutions?


Hi Tobias,

I am in favor of prefixing all transformation passes with -fgraphite-
leading to -fgraphite-block -fgraphite-stripmine -fgraphite-identity and 
the new ones to come. These flags should probably disappear at some 
point, as a more unified optimization heuristic should arise that 
combines all transformations, with some parameters to influence the 
profitability heuristic.


I agree -fgraphite should be reserved for later use, but not mandatory 
to use any of the new -fgraphite-xxx options.


This is just a suggestion with a partial understanding of the option 
naming scheme in GCC only!


Albert


Re: GCC & OpenCL ?

2009-02-03 Thread Albert Cohen

Ross Ridge wrote:

Basile STARYNKEVITCH writes:

It seems to me that some specifications
seems to be available. I am not a GPU expert, but
http://developer.amd.com/documentation/guides/Pages/default.aspx
contains a R8xx Family Instruction Set Archictectire document at
http://developer.amd.com/gpu_assets/r600isa.pdf and at a very quick
first glance (perhaps wrongly) I feel that it could be enough to design &
write a code generator for it.


Oh, ok, that makes a world of difference.  Even with just AMD GPU
support a GCC-based OpenCL implementation becomes a lot more practical.

Ross Ridge


I am also interested in working on OpenCL support for GCC, for several 
reasons.


1. This would be a way to increase accessibility of OpenCL to a wider 
set of users, programming environments (and source languages), and 
target platforms, including embedded ones (think of ARM/NVidia Tigra).


2. It would make good use of the high-level optimizations building up in 
GCC, including LTO (for specialization purposes, something LLVM is great 
at, but which could be done at a much larger scale in GCC), Graphite, 
automatic vectorization for Larrabee-like SIMD architectures mixing 
vectors and threads, etc.


3. Certainly a great step towards automatic parallelization for 
heterogeneous architectures, and research in performance portability in 
general.


Those 3 reasons (and there may be others) advocate for a front-end and 
middle-end awareness about OpenCL, not only library stuff, and certainly 
advocate for doing it in GCC rather than anywhere else.


I think the killer argument for GCC support of OpenCL is Larrabee, and 
heterogeneous multicores in general. GCC must see those architectures as 
one single target with multiple sub-targets. This may become a survival 
issue within a couple of years.


A side question is whether GCC should become 
single-source-multiple-target compiler, where a single compilation unit 
can lead to code generated on multiple ISAs. Note that this is not out 
of reach at all, since a short-cut exists, with attributes guarding the 
code generation of some functions or variable declarations, allowing to 
generate code only for a given target at a time.  Several people tried 
it, and it does not require reworking any machine description, although 
multiple runs of cc1/... would still be necessary.


Feedback welcome!

Albert Cohen


Re: ANNOUNCEMENT: Generic Data Flow Analyzer for GCC

2009-03-04 Thread Albert Cohen

Emmanuel Fleury wrote:

Andrew Pinski wrote:

On Wed, Mar 4, 2009 at 8:33 AM, Manuel López-Ibáñez
 wrote:

The proxy server received an invalid response from an upstream server.


I can access without problem.

I get the same error message that Tom gets.  I am on AT&T DSL so I
doubt it is my side which is having issues.


I have no problem accessing it (from France network).


It does NOT work for me either.

Albert


Re: Automatic Parallelization & Graphite - future plans

2009-03-18 Thread Albert Cohen

Antoniu Pop wrote:
(...)

The multiple backends compilation is not directly related, so you
should use a separate branch. It makes sense to go in that direction.


Indeed.

There has been some work in the area, using different approaches. I've 
been involved in one attempt, for the Cell, with Cupertino Miranda in CC.


Cupertino: could the URL where to find documentation on your 
experiments, and the (old) patch to GCC and the (old) Cell SDK for that 
purpose?


Ulrich Weigand and others at IBM have also been looking in that 
direction, I'm not sure what is available there. We planned to 
collaborate with Uri on the follow-up of Cupertino's work, but this has 
been postponed until an urgent need would pop up.


Albert


Re: Automatic Parallelization & Graphite - future plans

2009-03-18 Thread Albert Cohen

Steven Bosscher wrote:

On Wed, Mar 18, 2009 at 8:17 PM, Albert Cohen  wrote:

Antoniu Pop wrote:
(...)

The multiple backends compilation is not directly related, so you
should use a separate branch. It makes sense to go in that direction.

Indeed.


Work has been going on for years in this direction, but it has never
gone very far.



There has been some work in the area, using different approaches. I've been
involved in one attempt, for the Cell, with Cupertino Miranda in CC.

Cupertino: could the URL where to find documentation on your experiments,
and the (old) patch to GCC and the (old) Cell SDK for that purpose?


What approach was taken in these experiments?


Cupertino will send you the documentation and reference to the old patch 
sent on the gcc-patches list.


In brief, there is no hotswapping, just attributes to let the MIDDLE-END 
decide, right before expanding to RTL, which part of the trees to keep 
and which to drop. Selection is done at the function level, and at the 
variable declaration level.


You still need multiple runs of cc1, but they could be hidden behind a 
single run of the driver. It may not be a good idea, though, since 
different optimization flags may be relevant for the different backends 
(and even for the different functions, but this is a distinct issue).


The point, for the Cell, was to perform whole-program analysis across 
ISA boundaries. E.g., looking at IPCP or specialization and inlining. 
Another example is to be able to assess the profitability of a 
transformation on a function that compiles to target X, but internally 
depends (calls) a function on target Y. You would have to assess the 
side-effects, cost, and pass static and dynamic profile data across the 
boundaries again. This was only a target direction, we did not do 
anything there.


We struggled a lot with the data types and API that differ widely and 
non-consistently between the Cell PPE and SPE... as if IBM did not think 
until very late that single-source-multiple-backend compilation was 
relevant!


Albert


Re: [graphite] Weekly phone call notes

2009-04-30 Thread Albert Cohen

Sebastian Pop wrote:

On Wed, Apr 29, 2009 at 17:15, Richard Guenther
 wrote:

I don't see how SSA form makes anything more complicated ;)



One of the difficulties was regenerating the phi nodes after code
hoisting: CLooG optimizes

for (i)
  if (invariant of i)
s += A[i];

into

if (invariant of i)
  for (i)
s += A[i];

In the transformed code you have no place to put the phi nodes that
you had after the condition.

Add to this the problem of code duplication that CLooG does sometimes:

if (invariant of i)
  for (i in domain1)
s += A[i];
  for (i in domain2)
s += A[i];
  ...

Maintaining the SSA form for s is difficult after such transforms.  If
you figure out a good way to maintain the SSA form, I'm very
interested to hear about.


I believe the short-cut proposed by Sebastian makes sense. We never go 
out of SSA, just the hard-to-maintain-in-SSA induction variables are 
converted temporarily into single-element arrays. This of course is only 
a quick fix, and it does handle all cases. It will not complicate a 
future rewrite of this into a nice in-SSA induction variable 
reconstruction (an unexpected problem, worth investigating indeed, and 
maybe a future deeper research result is hiding).


Albert




Interest in integer auto-upcasting pass for normalization and optimization?

2009-05-09 Thread Albert Cohen
Sebastian Pop and I have been discussing the option of designing a new 
pass, based on vrp, to normalize integer types towards a canonical 
supertype typically a machine word, equivalent to signed long, or to 
truncate to a smaller-size word when it makes sense. This would be a 
very simple pass (on top of not-so-simple vrp), but arguably a quite 
regression-prone one as well (due to aliases/escape and common C 
standard violations).


The pass could be parameterized with three different objectives, 
depending on where it is scheduled in the pass manager.


(1) canonicalize to the supertype aggressively, to facilitate the 
application of further passes like autovect which require very precise 
understanding of the type conversions;
(2) compress the types to increase vectorization factor and reduce 
register pressure (assuming the target supports sub-word register 
allocation with register aliases);
(3) optimize the types to minimize the dynamic number of casts that 
result in actual ASM instructions.


Graphite and the vectorizer would clearly benefit from such a pass, at 
least if it implemented objective (1).


I wonder if some of this is already implemented somewhere, or if someone 
played with it in the past, or is interesting in contributing.


Nothing is planned yet on our side, and temporary fixes exist in the 
short term (as far as Graphite and the vectorizer are concerned), but it 
would potentially be of great help.


Feedback welcome,
Albert


Re: Interest in integer auto-upcasting pass for normalization and optimization?

2009-05-10 Thread Albert Cohen

Richard Guenther wrote:

On Sat, May 9, 2009 at 10:42 PM, Richard Guenther
 wrote:

On Sat, May 9, 2009 at 10:07 PM, Albert Cohen  wrote:

Sebastian Pop and I have been discussing the option of designing a new pass,
based on vrp, to normalize integer types towards a canonical supertype
typically a machine word, equivalent to signed long, or to truncate to a
smaller-size word when it makes sense. This would be a very simple pass (on
top of not-so-simple vrp), but arguably a quite regression-prone one as well
(due to aliases/escape and common C standard violations).

The pass could be parameterized with three different objectives, depending
on where it is scheduled in the pass manager.

(1) canonicalize to the supertype aggressively, to facilitate the
application of further passes like autovect which require very precise
understanding of the type conversions;
(2) compress the types to increase vectorization factor and reduce register
pressure (assuming the target supports sub-word register allocation with
register aliases);
(3) optimize the types to minimize the dynamic number of casts that result
in actual ASM instructions.

Graphite and the vectorizer would clearly benefit from such a pass, at least
if it implemented objective (1).

I wonder if some of this is already implemented somewhere, or if someone
played with it in the past, or is interesting in contributing.

Nothing is planned yet on our side, and temporary fixes exist in the short
term (as far as Graphite and the vectorizer are concerned), but it would
potentially be of great help.

This is certainly one useful transformation based on value-range information.
The choice of a canonical type is of course at least target dependent.


This btw. can at least partly replace the SEE (or the missed ZEE) pass.


We'll start by looking at what is done there, thanks. Not sure when we 
will start, though...



I suppose you want to do this on register variables only?  Did you think about
promoting function arguments and returns as well as part of an IPA pass?

I don't understand how register variable promotion/demotion will help graphite
though - I had the impression graphite can only work on memory.  No?


It will work at a higher level: graphite generates new loops with brand 
new induction variables, whose types are canonical (cf. your earlier 
canonical type suggestion). No way those canonical types can match 
preexisting induction variables in general. This causes a lot of casts 
in some cases, and confuses the vectorizer, and may even generate nasy 
SE/ZE instructions eventually.


Thank you,
Albert


Re: Register Pressure in Instruction Level Parallelism

2009-06-28 Thread Albert Cohen

Hi all,

I'm convinced that saturation and sufficiency based approaches are the 
future of register pressure management.


[Please keep my colleague Sid Touati in CC of this thread, until he 
registers to the list.]


Unfortunately, the state of the art (more recent that the thesis 
referenced in the original email, see Touati's web page) is limited to 
basic block and software-pipelining scopes, and limited to scheduling.


Compared to the tasks currently managed by reload, it certainly misses a 
whole bunch of instruction selection and immediate operand/address mode 
corner case problems (depending on the target). It also misses global 
scheduling, but extended BB scheduling is not very hard to approximate, 
as well as superblock scheduling.


I'm not at all a knowledgeable person to tell you what to do in the case 
of GCC, but for sure saturation/sufficiency-based approches are not 
sufficient to kill the dragon.


However, I will strongly advise anybody (= Kenny Zadeck) looking at a 
fresh SSA-based backend design to consider such an approach to avoid 
messing up with pressure-aware-* where * is any backend pass that bears 
a risk of blowing up register pressure.


If you interested in collaboration on the topic, we are about to start 
extending those approaches to general control-flow, and implementing it 
in a production compiler (not GCC, but a free one, and not LLVM either).


Albert


Dave Korn wrote:

Michael Kruse wrote:


So, now my questions: How much do you think could this could improve
compiled code speed? Would the current GCC/YARA benefit from such an
optimization pass at all? What are the chances that this could get into
the main GCC tree if it shows up to be an improvement?


  One of the major problems in gcc is the intertangling of instruction
selection with register allocation and spill generation.  If these could be
separated it would almost certainly generate better code and be welcomed with
open arms!


I'd prefer to implement this for the gcc, but my advisor wants me to do
it for the university's own compiler. Therefore I could also need
arguments why to do it for the GCC.


  Because destroying reload(*) would be an incalculable public service and
your name will be remembered in history as the one who slew the dragon? ;-)

cheers,
  DaveK




Re: Register Pressure in Instruction Level Parallelism

2009-06-28 Thread Albert Cohen

Dave Korn wrote:

(...)


I fully agree with your arguments. Managing register pressure early, and 
simplifying downstream passes (to avoid poluting them with pressure 
concerns) is a very tempting design. Yet from dream to reality there is 
a long way.



  :) I'm not up on advanced compiler theory, I'll have to sit back at this
point and leave it to the collective genii to get on with!  (I'll just suggest
that one advantage of implementing things in GCC is that it runs in such a
huge range of environments and targets, it's a good way to expose any new
algorithm to "interesting" problems caused by quirky architectures.)


Exactly. Much register allocation and scheduling work is completely 
useless for a retargettable compilers for real architectures due to the 
lack of realistic constraint modeling. Of course, there are many other 
reasons that make academic research results useless, this is just one 
possible way to miss the point...


Register saturation is a very powerful concept that remains to be 
extended to real conditions and beyond the interplay of allocation and 
scheduling.


Albert



Re: Register Pressure in Instruction Level Parallelism

2009-06-30 Thread Albert Cohen

Vladimir Makarov wrote:
I've just checked the thesis again.  I don't think decreasing register 
pressure through spilling will work well.  First of all Polleto linear 
scan RA is worse than Chaitin-Briggs approach.  Even its major 
improvement extending linear scan is worse than Chaitin-Briggs 
approach.  My experience with an ELS implementation in GCC has shown me 
this although in original article about ELS the opposite is stated (the 
comparison in the article was done in GCC but with the new ra project 
which was unsuccessful implementation of Chaitin-Briggs RA and it was 
done only on ppc64.  I am sure that on x86/x86_64 ELS would behave even 
worse).  That is about basic RA spill in Touti's thesis.


Touati's thesis has a section on spill insertion in the DDG to deal with 
excessive register sufficienty, but he did not have time to implement it.


However, there is a new implementation, in C, to compute saturation, 
sufficiency, insert the extra dependences and spills. This is not yet as 
complete as the previous version (used to be in C++ and based on the 
proprietary LEDA graph algorithm toolkit), but is available to those of 
you interested. The license will a priori become LGPL for the time 
being, but I guess Sid Touati is willing to assign his copyright for 
specific versions and transfer the code directly to GCC if needed. In 
any case, please talk to Sid directly (in CC).


Albert


Re: C++0x Memory model and gcc

2010-05-08 Thread Albert Cohen

Jean-Marc Bourguet wrote:

-fmemory-model=single
Assume single threaded execution, which also means no signal
handlers.
-fmemory-model=fast
The user is responsible for all synchronization.  Accessing
the same memory words from different threads may break
unpredictably.
-fmemory-model=safe
The compiler will do its best to protect you.


With that description, I'd think that "safe" lets the user code assumes
the sequential consistency model.  I'd use -fmemory-model=conformant or
something like that for the model where the compiler assumes that the user
code respect the constraint led out for it by the standard.  As which
constraints are put on user code depend on the languages -- Java has its
own memory model which AFAIK is more constraining than C++ and I think Ada
has its own but my Ada programming days are too far for me to comment on
it -- one may prefer some other name.


I agree. Or even, =c++0x or =gnu++0x

On the other hand, I fail to see the differen between =single and =fast, 
and the explanation about "the same memory word" is not really relevant 
as memory models typically tell you about concurrent accesses to 
"different memory words".


Albert


Re: [graphite] Loop tiling

2008-06-10 Thread Albert Cohen
Hi all,

Yes, Sebastian is Right for now, and thanks for pointing out the references.

Yet in the longer term, we are working on CLooG extensions to facilitate
tiling with parametric sizes, multidomensional tiling, increased
scalability, no risks of integer overflows, and better quality of
generated code... :-).

The magic is mostly in a couple of papers by Radjopadhye and his
students, at PLDI'07 and SC'07. We are working on extending it to
affine-by-statement scheduling and imperfectly nested loops, and also to
 make it run within one single run of CLooG (see those papers for details).

More info at an upcoming phone meeting, or if Cedric wants to comment a
little more.

Albert

Sebastian Pop wrote:
> Hi Tobi,
> 
> Thanks for working on this.
> 
> Solution 2 is the right one.
> 
> On Mon, Jun 9, 2008 at 11:58 AM, Tobias Grosser
> <[EMAIL PROTECTED]> wrote:
>> 1. It is not very efficient. As we have many multiplications in the
>> code.
>> Will later gcc optimization passes convert these multiplications to
>> additions?
>>
> 
> Yes, you should not worry about scalar optimizations at all.
> 
>> 2. I could not find any documentation about scattering functions, that
>> describe this way of using the scattering functions.
>>
> 
> I think that the best place to look at this is:
> page 4 of http://www.lri.fr/~girbal/publi/ics05_facilitating.ps
> 
> also have a look at the other reports from:
> http://www.lri.fr/~girbal/site_wrapit/
> 
> Albert probably has some other pointers for reports that describe how
> to do loop tiling.
> 
>> Another graphite specific problem is, how we connect the real loop
>> variables to the cloog variables. Before loop tiling this was easy.
>> Loops of the same depth in the cloog output correspond to the loops of
>> the same depth in the original gimple loop nest. But know we have to add
>> some data structure, that connects the gimple loops, with the cloog
>> loops.
>>
> 
> This was also the case before: we needed a map between the old
> induction variable and the new ones.
> 
> Sebastian Pop
> --
> AMD - GNU Tools


Re: [graphite] Get graphite backend working again [scalar variable handling]

2008-08-20 Thread Albert Cohen
Tobias Grosser wrote:
> I would like to improve the way how we handle scalar variables and ivs
> during graphite transformation.
> I am not sure, if I got it right and where to put my code in the backend.
> So it would be great, if you could read over my ideas.
> 
> The problem:
> 
> In graphite we represent the complete loop structure using polyhedrons to
> apply our transformations.
> To go back to gimple we have to generate new loop structures and delete
> the old loop structures.
> One step is to remove old ivs and generate new ones. 
> 
> 
> Example:
> 
> 1. original bb:
>   
>   D.1918_7 = a[i_22];
>   D.1919_8 = D.1918_7 + 1;
>   a[j_21] = D.1919_8;
>   j_9 = j_21 + 1;
>   
> 
> 
> 2. Add new ivs
> 
>   graphiteIV_1 = PHI (0, graphiteIV_2)
>      
>   D.1918_7 = a[i_22];
>   D.1919_8 = D.1918_7 + 1;
>   a[j_21] = D.1919_8;
>   j_9 = j_21 + 1;
>   
>   graphiteIV_2 = graphiteIV_1 + 1
> 
> 
> 3. Move ivs usages to the ivs.
> 
>   graphiteIV_1 = PHI (0, graphiteIV_2)
>    
>   D.1918_7 = a[i_22];
>   D.1919_8 = D.1918_7 + 1;
>   a[graphiteIV_1] = D.1919_8;
>   j_9 = j_21 + 1;
>   
>   graphiteIV_2 = graphiteIV_1 + 1
> 
> 
> 4. remove ivs (Here {j_9, j_21})
> 
>   
>   D.1918_7 = a[i_22];157235101 disconnected
>   D.1919_8 = D.1918_7 + 1;
>   a[graphiteIV_1] = D.1919_8;
>   

That's my understanding as well.

> The problem seems to become more complicate, if there exist two or more
> ivs.

More complicated if you want to make sure all dead variables are removed
(with the associated definition code), but is it critical, given that a
PRE/CSE will clean this up afterwards?

> How I would like to solve it:
> =
> 
> During gimple->graphite transformation we ignore all scalar variables,
> that we can represent using scev. (These contain also all the ivs)
> 
> Here {D.1918_7, D.1919_8} are not representable, as they depend on the
> value of a[i_22]. Therefore we can not ignore them.
> But {j_9, j_21} will be representable and we ignore them.
> 
> During graphite we have all ivs virtually removed.

This would be very good. I fully agree, yet more motivated by dependence
removal rather than clean-output goals (cf. PRE/CSE argument).

> The next step is during code generation:
> We remove all scalar variables, we ignored before, and recreate for the
> places, where they where used, new scalar variables, that depend on the
> new ivs.

The tricky part might be that those so-called "new variables" have to be
 named like the old ones, although plenty of new variables will have
been synthesized in the mean time.

> Why remove all and not just the ivs?
> 
> 
> There are two points.
> 
> 1. Detection ivs is much work. So I prefer this general algorithm, as I
> hope it is easier to implement and runs faster.

Agree.

> 2. If we can ignore the scalar variables, we can also ignore their
> dependencies. So we have more freedom to schedule the bbs.

Exactly.

> 
> See you
> Tobi
> 
> P.S.: To start working on my code I would like to get block-0.c running.
> Jan, can I help you with anything?



Re: [graphite] Get graphite backend working again [scalar variable handling]

2008-08-21 Thread Albert Cohen
Tobias Grosser wrote:
> (...)
> Why is there a need to name the new ones like the old ones? The only
> reason I can think about is debugging. And I am not sure if their exists
> a strong relation between old an new ones, that makes it reasonable to
> call them identical. During all the code motion, we ignored the scalar
> variables completely, and the recreation using scev has only a weak
> relation to the original variables. So I think we may get completely new
> ones as for me it does not make sense to keep these relation.
> But you seem to have a different opinion. What makes you think, the new
> ones have to be called like the old ones? What relation do you see? 

You are correct. It is sufficient to substitute the SSA variables
wherever they appear, as it is done now.

Albert