complete_unrolli / complete_unroll
When debugging graphite, we ran into code bloat issues due to pass_complete_unrolli being called very early in the non-ipa optimization sequence. Much later, the full-blown pass_complete_unroll is scheduled, and this one does not do any harm. Strangely, this early unrolling pass (tuned to only unroll inner loops) is only enabled at -O3, independently of the -funroll-loops flag. Does anyone remember why it is there, for which platform it is useful, and what are the perf regressions if we remove it? My guess is that it may only harm... disabling or damaging the effectivenesss of the (loop-level) vectorizer and increasing compilation time. Thanks, Albert PS: When this question is solved, it will also be interesting to start a serious discussion on how to improve the flexibility in customizing pass ordering and parameterization of passes depending on the target. Grigori Fursin's work shows the strong benefits and already provides a working prototype. This question is independent of whether the customization is done by experts or machine-learning/statistical techniques.
Re: complete_unrolli / complete_unroll
Richard Guenther wrote: > 2009/8/19 Albert Cohen : >> When debugging graphite, we ran into code bloat issues due to >> pass_complete_unrolli being called very early in the non-ipa >> optimization sequence. Much later, the full-blown pass_complete_unroll >> is scheduled, and this one does not do any harm. >> >> Strangely, this early unrolling pass (tuned to only unroll inner loops) >> is only enabled at -O3, independently of the -funroll-loops flag. >> >> Does anyone remember why it is there, for which platform it is useful, >> and what are the perf regressions if we remove it? > > The early loop unrolling pass is very important to remove abstraction > penalty for C++ programs that chose not to implement manual > unrolling by relying on the inliner and template metaprogramming. > > In tramp3d you for example see (very much simplified, intermediate > state after some inlining): > > foo (int i, int j, int k) > { > double a[][][]; > int index[3]; > const int dX[3] = { 1, 0, 0 }; > ... > for (m=0; m<3; ++m) > index[m] = 0; > index[0] = i; > index[1] = j; > index[2] = k; > ... a[index[0]][index[1]][index[2]]; > for (m=0; m<3; ++m) > index[m] += dx[m]; > ... a[index[0]][index[1]][index[2]]; > > etc. to access a[i][j][k] and a[i+1][j][k]. > > There is an absoulte need to unroll these simple loops before > CSE otherwise loop optimizations have no chance on optimizing > anything here. > > Another benchmark that degrades considerably without early > unrolling is 454.calculix (in fact that one was the reason to > add this pass). > >> My guess is that it may only harm... disabling or damaging the >> effectivenesss of the (loop-level) vectorizer and increasing compilation >> time. > > No it definitely does not. But it has one small issue in that it sometimes > also unrolls an outermost loop IIRC, that could be fixed. Thanks a lot for the quick and detailed response. It is more difficult than I thought, then :-( We'll think more, and maybe come up with yet another pass ordering proposal, but definitely this tramp3d code deserves to be processed by graphite AFTER unrolling+cse has done its specialization trick. Albert
Re: complete_unrolli / complete_unroll
Albert Cohen wrote: > Thanks a lot for the quick and detailed response. > > It is more difficult than I thought, then :-( We'll think more, and > maybe come up with yet another pass ordering proposal, but definitely > this tramp3d code deserves to be processed by graphite AFTER > unrolling+cse has done its specialization trick. One way out would be to make unrolli pass a little more careful. As you suggest, the heuristic is already not quite satisfactory as it sometimes unrolls outemost loops. A better heursitic would be to run through the different cases where unrolling helps specialization (e.g., the subscripts of subscripts in the tramp3d example), and check for these patterns explicitely. But this is not easy to implement (or to make it robust, and not too syntax-dependent). Albert
Re: complete_unrolli / complete_unroll
Richard Guenther wrote: gfortran.dg/reassoc_4.f, the hottest loop from calculix. Thanks. This example is slightly different. Graphite should be able to handle it with loop fusion rather than pre-unrolling + cse. But I agree that the unrolling + cse approach also makes sense (and does not depend on the same legality constraints as loop fusion). This makes me think of a simple, general criterion to detect cases where pre-unrolling of inner loop helps further cse and loop optimizations. The idea is to unroll only when we can see some evidence of array references that are not presently loop-invariant that would be made (outer)-loop invariant via full unrolling of some inner loop. This can be implemented by complementing the current heuristic (or its complementary enhancements by Honza) with an additional condition, only enabled when running it with the "i" (inner) flag (which should probably be renamed if we do implement this...). The simplest, weakest condition I can think of would be to traverse all array references in the region enclosed by the loop-to-be-unrolled, compute the SCEV for each one, instanciate it in the loop's context, and checking if it only depends on the loop counter, as well as outer loop counters or parameters. This condition would a priori pass on the tramp3d and reassoc_4 cases. Yet it is probably too weak and will still pass on many codes where unrolling would probably not help at all... and probably harm. If this is the case, we should consider multiple loops to be unrolled, and the combined effect of unrolling ALL of these, resulting in complete instanciation of the array subscripts with constants. This is a very special case, again satisfied by our two motivating examples. Maybe it will be too specific and we'll have performance regressions... It remained to be investigated if we have to go through a stricter condition than the first, weak one I proposed. If this is not clear, I can write some pseudo-code to clarify :-). Albert
Re: complete_unrolli / complete_unroll
Richard Guenther wrote: >> If this is not clear, I can write some pseudo-code to clarify :-). > > Can't we use graphite to re-roll loops? That is, compress the > polyhedron by introducing a new parameter? But maybe I am > not good at guessing what your initial bloat issue looks like. > > The reason I'm asking is that there is enough code out in the > wild that has loops with manually unrolled bodies - I have seen > up to 16 times here. > I agree that the conditions I propose are not as reliable as unrolling and checking if it helps. At some point, this kind of sandboxing of the IR to explore a tree of optimizations would be interesting... except for compilation time and memory usage :-( Great for iterative and machine learning optimization anyway. Regarding your rerolling question, it is currently not known. There is indeed a nice parallel between rerolling and the code generation algorithms in CLooG. But this is only for rerolling right after unrolling. The real problem is rerolling after a sequence of optimizations that took advantage of prior unrolling. In this case, we have an algorithm equivalence problem, very general and very nasty. Polyhedral approaches do help but so far did not do much more than very theoretical papers and toy prototypes (I can give references if you are interested). Clearly, it would be nice to have a rerolling pass, it would also help the intra-block vectorization (there are specific papers on this, and preliminary support in the vectorizer), but it is not something people understand well. We'll wait a little, but more feedback on conditions to stricten the application of the early unrolling pass will be helpful, then one of the Graphite developers may gets his or her hand dirty on it. Albert
Re: [graphite] Cleanup of command line parameters
Tobias Grosser wrote Hi graphities, graphite consists of four flags "-floop-block", "-floop-interchange", "-floop-stripmine" and "-fgraphite". If any of these flags is set, we enable the graphite pass and we search for SCoPs. For every SCoP we try to apply transformations specified with "-floop-block", "-floop-interchange" or "-floop-stripmine". But only if we could apply a transformation, we replace the SCoP with new code generated from the GRAPHITE data structures. Otherwise we do not touch the GIMPLE representation. If we only specify "-fgraphite", we never generate code for any SCoP, as we never tried any transformation. So just using "-fgraphite" is useless. What is missing is a way to make GRAPHITE always generate code, even if we did not apply any transformation. This has two benefits: - We can check the overhead the GIMPLE -> GRAPHITE -> GIMPLE transformation costs. - Wider test coverage, as we are able to run the code generator for every SCoP, not only the SCoPs, we are able to transform. Now: -fgraphite: Do nothing. -floop-block, -floop-interchange, -floop-stripmine: Try these transformations. Only "-fgraphite": Do nothing Only "-floop-block": Only loop blocked SCoPs are changed. "-fgraphite -floop-block": Only loop blocked SCoPs are changed. One solution: Reuse -fgraphite -- -fgraphite: Enable always code generation -floop-block, -floop-interchange, -floop-stripmine: Try these transformations. Only "-fgraphite": The identity transformation for all SCoPs. Only "-floop-block": Only loop blocked SCoPs are changed. "-fgraphite -floop-block": All SCoPs are are changed. Some SCoPs are loop blocked, others just the identity transformation. This allows us to get all the benefits but (mis)uses the -fgraphite flag. At the moment it has no other meaning, but I could think of it containing an automatic optimizer, that applies all available graphite transformations or selects them automatically. The other solution is: Use -fgraphite-identity -- -fgraphite: Enable always code generation -floop-block, -floop-interchange, -floop-stripmine, -fgraphite-identity: Try these transformations. Only "-fgraphite": Do nothing. Free for later use. Only "-floop-block": Only loop blocked SCoPs are changed. Only "-fgraphite-identity": The identity transformation for all SCoPs. "-fgraphite-identity -floop-block": All SCoPs are are changed. Some SCoPs are loop blocked, others just the identity transformation. This makes sense, as we only generate code for applied and enabled transformations. These are loop-blocking, interchange, stripmine and - may be - the new and very easy to write identity transformation, that does nothing, but enable code generation. What du you think about these different solutions? Hi Tobias, I am in favor of prefixing all transformation passes with -fgraphite- leading to -fgraphite-block -fgraphite-stripmine -fgraphite-identity and the new ones to come. These flags should probably disappear at some point, as a more unified optimization heuristic should arise that combines all transformations, with some parameters to influence the profitability heuristic. I agree -fgraphite should be reserved for later use, but not mandatory to use any of the new -fgraphite-xxx options. This is just a suggestion with a partial understanding of the option naming scheme in GCC only! Albert
Re: GCC & OpenCL ?
Ross Ridge wrote: Basile STARYNKEVITCH writes: It seems to me that some specifications seems to be available. I am not a GPU expert, but http://developer.amd.com/documentation/guides/Pages/default.aspx contains a R8xx Family Instruction Set Archictectire document at http://developer.amd.com/gpu_assets/r600isa.pdf and at a very quick first glance (perhaps wrongly) I feel that it could be enough to design & write a code generator for it. Oh, ok, that makes a world of difference. Even with just AMD GPU support a GCC-based OpenCL implementation becomes a lot more practical. Ross Ridge I am also interested in working on OpenCL support for GCC, for several reasons. 1. This would be a way to increase accessibility of OpenCL to a wider set of users, programming environments (and source languages), and target platforms, including embedded ones (think of ARM/NVidia Tigra). 2. It would make good use of the high-level optimizations building up in GCC, including LTO (for specialization purposes, something LLVM is great at, but which could be done at a much larger scale in GCC), Graphite, automatic vectorization for Larrabee-like SIMD architectures mixing vectors and threads, etc. 3. Certainly a great step towards automatic parallelization for heterogeneous architectures, and research in performance portability in general. Those 3 reasons (and there may be others) advocate for a front-end and middle-end awareness about OpenCL, not only library stuff, and certainly advocate for doing it in GCC rather than anywhere else. I think the killer argument for GCC support of OpenCL is Larrabee, and heterogeneous multicores in general. GCC must see those architectures as one single target with multiple sub-targets. This may become a survival issue within a couple of years. A side question is whether GCC should become single-source-multiple-target compiler, where a single compilation unit can lead to code generated on multiple ISAs. Note that this is not out of reach at all, since a short-cut exists, with attributes guarding the code generation of some functions or variable declarations, allowing to generate code only for a given target at a time. Several people tried it, and it does not require reworking any machine description, although multiple runs of cc1/... would still be necessary. Feedback welcome! Albert Cohen
Re: ANNOUNCEMENT: Generic Data Flow Analyzer for GCC
Emmanuel Fleury wrote: Andrew Pinski wrote: On Wed, Mar 4, 2009 at 8:33 AM, Manuel López-Ibáñez wrote: The proxy server received an invalid response from an upstream server. I can access without problem. I get the same error message that Tom gets. I am on AT&T DSL so I doubt it is my side which is having issues. I have no problem accessing it (from France network). It does NOT work for me either. Albert
Re: Automatic Parallelization & Graphite - future plans
Antoniu Pop wrote: (...) The multiple backends compilation is not directly related, so you should use a separate branch. It makes sense to go in that direction. Indeed. There has been some work in the area, using different approaches. I've been involved in one attempt, for the Cell, with Cupertino Miranda in CC. Cupertino: could the URL where to find documentation on your experiments, and the (old) patch to GCC and the (old) Cell SDK for that purpose? Ulrich Weigand and others at IBM have also been looking in that direction, I'm not sure what is available there. We planned to collaborate with Uri on the follow-up of Cupertino's work, but this has been postponed until an urgent need would pop up. Albert
Re: Automatic Parallelization & Graphite - future plans
Steven Bosscher wrote: On Wed, Mar 18, 2009 at 8:17 PM, Albert Cohen wrote: Antoniu Pop wrote: (...) The multiple backends compilation is not directly related, so you should use a separate branch. It makes sense to go in that direction. Indeed. Work has been going on for years in this direction, but it has never gone very far. There has been some work in the area, using different approaches. I've been involved in one attempt, for the Cell, with Cupertino Miranda in CC. Cupertino: could the URL where to find documentation on your experiments, and the (old) patch to GCC and the (old) Cell SDK for that purpose? What approach was taken in these experiments? Cupertino will send you the documentation and reference to the old patch sent on the gcc-patches list. In brief, there is no hotswapping, just attributes to let the MIDDLE-END decide, right before expanding to RTL, which part of the trees to keep and which to drop. Selection is done at the function level, and at the variable declaration level. You still need multiple runs of cc1, but they could be hidden behind a single run of the driver. It may not be a good idea, though, since different optimization flags may be relevant for the different backends (and even for the different functions, but this is a distinct issue). The point, for the Cell, was to perform whole-program analysis across ISA boundaries. E.g., looking at IPCP or specialization and inlining. Another example is to be able to assess the profitability of a transformation on a function that compiles to target X, but internally depends (calls) a function on target Y. You would have to assess the side-effects, cost, and pass static and dynamic profile data across the boundaries again. This was only a target direction, we did not do anything there. We struggled a lot with the data types and API that differ widely and non-consistently between the Cell PPE and SPE... as if IBM did not think until very late that single-source-multiple-backend compilation was relevant! Albert
Re: [graphite] Weekly phone call notes
Sebastian Pop wrote: On Wed, Apr 29, 2009 at 17:15, Richard Guenther wrote: I don't see how SSA form makes anything more complicated ;) One of the difficulties was regenerating the phi nodes after code hoisting: CLooG optimizes for (i) if (invariant of i) s += A[i]; into if (invariant of i) for (i) s += A[i]; In the transformed code you have no place to put the phi nodes that you had after the condition. Add to this the problem of code duplication that CLooG does sometimes: if (invariant of i) for (i in domain1) s += A[i]; for (i in domain2) s += A[i]; ... Maintaining the SSA form for s is difficult after such transforms. If you figure out a good way to maintain the SSA form, I'm very interested to hear about. I believe the short-cut proposed by Sebastian makes sense. We never go out of SSA, just the hard-to-maintain-in-SSA induction variables are converted temporarily into single-element arrays. This of course is only a quick fix, and it does handle all cases. It will not complicate a future rewrite of this into a nice in-SSA induction variable reconstruction (an unexpected problem, worth investigating indeed, and maybe a future deeper research result is hiding). Albert
Interest in integer auto-upcasting pass for normalization and optimization?
Sebastian Pop and I have been discussing the option of designing a new pass, based on vrp, to normalize integer types towards a canonical supertype typically a machine word, equivalent to signed long, or to truncate to a smaller-size word when it makes sense. This would be a very simple pass (on top of not-so-simple vrp), but arguably a quite regression-prone one as well (due to aliases/escape and common C standard violations). The pass could be parameterized with three different objectives, depending on where it is scheduled in the pass manager. (1) canonicalize to the supertype aggressively, to facilitate the application of further passes like autovect which require very precise understanding of the type conversions; (2) compress the types to increase vectorization factor and reduce register pressure (assuming the target supports sub-word register allocation with register aliases); (3) optimize the types to minimize the dynamic number of casts that result in actual ASM instructions. Graphite and the vectorizer would clearly benefit from such a pass, at least if it implemented objective (1). I wonder if some of this is already implemented somewhere, or if someone played with it in the past, or is interesting in contributing. Nothing is planned yet on our side, and temporary fixes exist in the short term (as far as Graphite and the vectorizer are concerned), but it would potentially be of great help. Feedback welcome, Albert
Re: Interest in integer auto-upcasting pass for normalization and optimization?
Richard Guenther wrote: On Sat, May 9, 2009 at 10:42 PM, Richard Guenther wrote: On Sat, May 9, 2009 at 10:07 PM, Albert Cohen wrote: Sebastian Pop and I have been discussing the option of designing a new pass, based on vrp, to normalize integer types towards a canonical supertype typically a machine word, equivalent to signed long, or to truncate to a smaller-size word when it makes sense. This would be a very simple pass (on top of not-so-simple vrp), but arguably a quite regression-prone one as well (due to aliases/escape and common C standard violations). The pass could be parameterized with three different objectives, depending on where it is scheduled in the pass manager. (1) canonicalize to the supertype aggressively, to facilitate the application of further passes like autovect which require very precise understanding of the type conversions; (2) compress the types to increase vectorization factor and reduce register pressure (assuming the target supports sub-word register allocation with register aliases); (3) optimize the types to minimize the dynamic number of casts that result in actual ASM instructions. Graphite and the vectorizer would clearly benefit from such a pass, at least if it implemented objective (1). I wonder if some of this is already implemented somewhere, or if someone played with it in the past, or is interesting in contributing. Nothing is planned yet on our side, and temporary fixes exist in the short term (as far as Graphite and the vectorizer are concerned), but it would potentially be of great help. This is certainly one useful transformation based on value-range information. The choice of a canonical type is of course at least target dependent. This btw. can at least partly replace the SEE (or the missed ZEE) pass. We'll start by looking at what is done there, thanks. Not sure when we will start, though... I suppose you want to do this on register variables only? Did you think about promoting function arguments and returns as well as part of an IPA pass? I don't understand how register variable promotion/demotion will help graphite though - I had the impression graphite can only work on memory. No? It will work at a higher level: graphite generates new loops with brand new induction variables, whose types are canonical (cf. your earlier canonical type suggestion). No way those canonical types can match preexisting induction variables in general. This causes a lot of casts in some cases, and confuses the vectorizer, and may even generate nasy SE/ZE instructions eventually. Thank you, Albert
Re: Register Pressure in Instruction Level Parallelism
Hi all, I'm convinced that saturation and sufficiency based approaches are the future of register pressure management. [Please keep my colleague Sid Touati in CC of this thread, until he registers to the list.] Unfortunately, the state of the art (more recent that the thesis referenced in the original email, see Touati's web page) is limited to basic block and software-pipelining scopes, and limited to scheduling. Compared to the tasks currently managed by reload, it certainly misses a whole bunch of instruction selection and immediate operand/address mode corner case problems (depending on the target). It also misses global scheduling, but extended BB scheduling is not very hard to approximate, as well as superblock scheduling. I'm not at all a knowledgeable person to tell you what to do in the case of GCC, but for sure saturation/sufficiency-based approches are not sufficient to kill the dragon. However, I will strongly advise anybody (= Kenny Zadeck) looking at a fresh SSA-based backend design to consider such an approach to avoid messing up with pressure-aware-* where * is any backend pass that bears a risk of blowing up register pressure. If you interested in collaboration on the topic, we are about to start extending those approaches to general control-flow, and implementing it in a production compiler (not GCC, but a free one, and not LLVM either). Albert Dave Korn wrote: Michael Kruse wrote: So, now my questions: How much do you think could this could improve compiled code speed? Would the current GCC/YARA benefit from such an optimization pass at all? What are the chances that this could get into the main GCC tree if it shows up to be an improvement? One of the major problems in gcc is the intertangling of instruction selection with register allocation and spill generation. If these could be separated it would almost certainly generate better code and be welcomed with open arms! I'd prefer to implement this for the gcc, but my advisor wants me to do it for the university's own compiler. Therefore I could also need arguments why to do it for the GCC. Because destroying reload(*) would be an incalculable public service and your name will be remembered in history as the one who slew the dragon? ;-) cheers, DaveK
Re: Register Pressure in Instruction Level Parallelism
Dave Korn wrote: (...) I fully agree with your arguments. Managing register pressure early, and simplifying downstream passes (to avoid poluting them with pressure concerns) is a very tempting design. Yet from dream to reality there is a long way. :) I'm not up on advanced compiler theory, I'll have to sit back at this point and leave it to the collective genii to get on with! (I'll just suggest that one advantage of implementing things in GCC is that it runs in such a huge range of environments and targets, it's a good way to expose any new algorithm to "interesting" problems caused by quirky architectures.) Exactly. Much register allocation and scheduling work is completely useless for a retargettable compilers for real architectures due to the lack of realistic constraint modeling. Of course, there are many other reasons that make academic research results useless, this is just one possible way to miss the point... Register saturation is a very powerful concept that remains to be extended to real conditions and beyond the interplay of allocation and scheduling. Albert
Re: Register Pressure in Instruction Level Parallelism
Vladimir Makarov wrote: I've just checked the thesis again. I don't think decreasing register pressure through spilling will work well. First of all Polleto linear scan RA is worse than Chaitin-Briggs approach. Even its major improvement extending linear scan is worse than Chaitin-Briggs approach. My experience with an ELS implementation in GCC has shown me this although in original article about ELS the opposite is stated (the comparison in the article was done in GCC but with the new ra project which was unsuccessful implementation of Chaitin-Briggs RA and it was done only on ppc64. I am sure that on x86/x86_64 ELS would behave even worse). That is about basic RA spill in Touti's thesis. Touati's thesis has a section on spill insertion in the DDG to deal with excessive register sufficienty, but he did not have time to implement it. However, there is a new implementation, in C, to compute saturation, sufficiency, insert the extra dependences and spills. This is not yet as complete as the previous version (used to be in C++ and based on the proprietary LEDA graph algorithm toolkit), but is available to those of you interested. The license will a priori become LGPL for the time being, but I guess Sid Touati is willing to assign his copyright for specific versions and transfer the code directly to GCC if needed. In any case, please talk to Sid directly (in CC). Albert
Re: C++0x Memory model and gcc
Jean-Marc Bourguet wrote: -fmemory-model=single Assume single threaded execution, which also means no signal handlers. -fmemory-model=fast The user is responsible for all synchronization. Accessing the same memory words from different threads may break unpredictably. -fmemory-model=safe The compiler will do its best to protect you. With that description, I'd think that "safe" lets the user code assumes the sequential consistency model. I'd use -fmemory-model=conformant or something like that for the model where the compiler assumes that the user code respect the constraint led out for it by the standard. As which constraints are put on user code depend on the languages -- Java has its own memory model which AFAIK is more constraining than C++ and I think Ada has its own but my Ada programming days are too far for me to comment on it -- one may prefer some other name. I agree. Or even, =c++0x or =gnu++0x On the other hand, I fail to see the differen between =single and =fast, and the explanation about "the same memory word" is not really relevant as memory models typically tell you about concurrent accesses to "different memory words". Albert
Re: [graphite] Loop tiling
Hi all, Yes, Sebastian is Right for now, and thanks for pointing out the references. Yet in the longer term, we are working on CLooG extensions to facilitate tiling with parametric sizes, multidomensional tiling, increased scalability, no risks of integer overflows, and better quality of generated code... :-). The magic is mostly in a couple of papers by Radjopadhye and his students, at PLDI'07 and SC'07. We are working on extending it to affine-by-statement scheduling and imperfectly nested loops, and also to make it run within one single run of CLooG (see those papers for details). More info at an upcoming phone meeting, or if Cedric wants to comment a little more. Albert Sebastian Pop wrote: > Hi Tobi, > > Thanks for working on this. > > Solution 2 is the right one. > > On Mon, Jun 9, 2008 at 11:58 AM, Tobias Grosser > <[EMAIL PROTECTED]> wrote: >> 1. It is not very efficient. As we have many multiplications in the >> code. >> Will later gcc optimization passes convert these multiplications to >> additions? >> > > Yes, you should not worry about scalar optimizations at all. > >> 2. I could not find any documentation about scattering functions, that >> describe this way of using the scattering functions. >> > > I think that the best place to look at this is: > page 4 of http://www.lri.fr/~girbal/publi/ics05_facilitating.ps > > also have a look at the other reports from: > http://www.lri.fr/~girbal/site_wrapit/ > > Albert probably has some other pointers for reports that describe how > to do loop tiling. > >> Another graphite specific problem is, how we connect the real loop >> variables to the cloog variables. Before loop tiling this was easy. >> Loops of the same depth in the cloog output correspond to the loops of >> the same depth in the original gimple loop nest. But know we have to add >> some data structure, that connects the gimple loops, with the cloog >> loops. >> > > This was also the case before: we needed a map between the old > induction variable and the new ones. > > Sebastian Pop > -- > AMD - GNU Tools
Re: [graphite] Get graphite backend working again [scalar variable handling]
Tobias Grosser wrote: > I would like to improve the way how we handle scalar variables and ivs > during graphite transformation. > I am not sure, if I got it right and where to put my code in the backend. > So it would be great, if you could read over my ideas. > > The problem: > > In graphite we represent the complete loop structure using polyhedrons to > apply our transformations. > To go back to gimple we have to generate new loop structures and delete > the old loop structures. > One step is to remove old ivs and generate new ones. > > > Example: > > 1. original bb: > > D.1918_7 = a[i_22]; > D.1919_8 = D.1918_7 + 1; > a[j_21] = D.1919_8; > j_9 = j_21 + 1; > > > > 2. Add new ivs > > graphiteIV_1 = PHI (0, graphiteIV_2) > > D.1918_7 = a[i_22]; > D.1919_8 = D.1918_7 + 1; > a[j_21] = D.1919_8; > j_9 = j_21 + 1; > > graphiteIV_2 = graphiteIV_1 + 1 > > > 3. Move ivs usages to the ivs. > > graphiteIV_1 = PHI (0, graphiteIV_2) > > D.1918_7 = a[i_22]; > D.1919_8 = D.1918_7 + 1; > a[graphiteIV_1] = D.1919_8; > j_9 = j_21 + 1; > > graphiteIV_2 = graphiteIV_1 + 1 > > > 4. remove ivs (Here {j_9, j_21}) > > > D.1918_7 = a[i_22];157235101 disconnected > D.1919_8 = D.1918_7 + 1; > a[graphiteIV_1] = D.1919_8; > That's my understanding as well. > The problem seems to become more complicate, if there exist two or more > ivs. More complicated if you want to make sure all dead variables are removed (with the associated definition code), but is it critical, given that a PRE/CSE will clean this up afterwards? > How I would like to solve it: > = > > During gimple->graphite transformation we ignore all scalar variables, > that we can represent using scev. (These contain also all the ivs) > > Here {D.1918_7, D.1919_8} are not representable, as they depend on the > value of a[i_22]. Therefore we can not ignore them. > But {j_9, j_21} will be representable and we ignore them. > > During graphite we have all ivs virtually removed. This would be very good. I fully agree, yet more motivated by dependence removal rather than clean-output goals (cf. PRE/CSE argument). > The next step is during code generation: > We remove all scalar variables, we ignored before, and recreate for the > places, where they where used, new scalar variables, that depend on the > new ivs. The tricky part might be that those so-called "new variables" have to be named like the old ones, although plenty of new variables will have been synthesized in the mean time. > Why remove all and not just the ivs? > > > There are two points. > > 1. Detection ivs is much work. So I prefer this general algorithm, as I > hope it is easier to implement and runs faster. Agree. > 2. If we can ignore the scalar variables, we can also ignore their > dependencies. So we have more freedom to schedule the bbs. Exactly. > > See you > Tobi > > P.S.: To start working on my code I would like to get block-0.c running. > Jan, can I help you with anything?
Re: [graphite] Get graphite backend working again [scalar variable handling]
Tobias Grosser wrote: > (...) > Why is there a need to name the new ones like the old ones? The only > reason I can think about is debugging. And I am not sure if their exists > a strong relation between old an new ones, that makes it reasonable to > call them identical. During all the code motion, we ignored the scalar > variables completely, and the recreation using scev has only a weak > relation to the original variables. So I think we may get completely new > ones as for me it does not make sense to keep these relation. > But you seem to have a different opinion. What makes you think, the new > ones have to be called like the old ones? What relation do you see? You are correct. It is sufficient to substitute the SSA variables wherever they appear, as it is done now. Albert