Register Pressure in Instruction Level Parallelism
Hello GCC developers, I am going to write a Master's Thesis about instruction scheduling based on the technique presented in [1]. This will also include a implementation. The basic idea is to have an additional pass on the data dependency graph before instruction scheduling and register allocation takes place. It estimates the minimal (register sufficiency) and maximal number of registers (register saturation) required by schedules based on that graph. If the register sufficiency is higher than the physical number of registers, spill code is added to the graph. If the register saturation is higher than the physical number of registers, artificial dependencies are added to the graph, such that the instruction scheduler is forced to generate a schedule with less registers. The aim is that the instruction scheduler can focus on the optimal arrangement of instructions to exploit a maximal amount of instruction level parallelism. [1] also suggests heuristics for all these problems. According to the author, these are "nearly optimal" in practice. The heuristics for estimating register sufficiency and saturation are both optimistic, meaning that it might still be necessary to add more spill code to the final code. Hence, this this technique is just an optimization pass and doesn't replace existing register allocation or instruction scheduling passes. [1] also has a part about loop scheduling, but my thesis will be about basic blocks only. See [2] for an presentation of this technique. So, now my questions: How much do you think could this could improve compiled code speed? Would the current GCC/YARA benefit from such an optimization pass at all? What are the chances that this could get into the main GCC tree if it shows up to be an improvement? There has been a short discussion on this mailing list already [3] some years ago (Note if you read this: intLP has been used to compare the heuristic to the optimal result only). To my knowledge, there hasn't been any more on this topic yet (anywhere). I'd prefer to implement this for the gcc, but my advisor wants me to do it for the university's own compiler. Therefore I could also need arguments why to do it for the GCC. Regards, Michael Kruse [1] http://www.prism.uvsq.fr/~touati/thesis.html [2] http://tel.archives-ouvertes.fr/docs/00/04/72/88/ANNEX/tel-7405.ppt [3] http://gcc.gnu.org/ml/gcc/2002-07/msg00565.html -- Tardyzentrismus verboten! smime.p7s Description: S/MIME Cryptographic Signature
Re: Register Pressure in Instruction Level Parallelism
[Posting this again because I noticed that I sent this to Dave Korn only] Hi Dave, Dave Korn wrote: > One of the major problems in gcc is the intertangling of instruction > selection with register allocation and spill generation. If these could be > separated it would almost certainly generate better code and be welcomed with > open arms! > The separation of these is one concern of the thesis. Although, it does not separate them completely. >> I'd prefer to implement this for the gcc, but my advisor wants me to do >> it for the university's own compiler. Therefore I could also need >> arguments why to do it for the GCC. >> > > Because destroying reload(*) would be an incalculable public service and > your name will be remembered in history as the one who slew the dragon? ;-) > Yeah, I already read the reload topic in the wiki ("...equivalent of Satan..." *g*). And it made me think about whether I really want do do that. But the good thing (for me) is that I don't have for change the reload pass for this as it is an additional pass, not a replacement. So I have to disappoint you here. That does not mean that this couldn't help in getting rid of the reload pass. After the modification of the ddg, the reload pass doesn't have to take care for optimization (very much) as this has already been done. Hence, this could greatly simplify the process. And I love your optimism :-) Btw, I guess my advisor doesn't accept your argument. The dragon on my dragon book is a very tough one *g*. And one of my advisor's arguments for not implementing it for the GCC is that their compiler would be less complicated. I can't confirm that since I don't have access to it yet. Regards, Michael Kruse -- Tardyzentrismus verboten! smime.p7s Description: S/MIME Cryptographic Signature
Re: Register Pressure in Instruction Level Parallelism
Hi, Albert Cohen schrieb: Unfortunately, the state of the art (more recent that the thesis referenced in the original email, see Touati's web page) is limited to basic block and software-pipelining scopes, and limited to scheduling. Do have any specific publication in mind? At the moment, my concerns are basic blocks only. As far as I can see from the web page, you are working on an implementation called DDG. I could not find a download link (except for some headers and examples). However, I will strongly advise anybody (= Kenny Zadeck) looking at a fresh SSA-based backend design to consider such an approach to avoid messing up with pressure-aware-* where * is any backend pass that bears a risk of blowing up register pressure. I don't know the deep internals of the gcc yet, but I think it is doable. My main work is to evaluate the effectiveness of this technique. I don't know of any later passes which introduce new register requirements. And even if... the mighty reload pass must be able to handle this. I am interested in a comparison to what there has been before. If you interested in collaboration on the topic, we are about to start extending those approaches to general control-flow, and implementing it in a production compiler (not GCC, but a free one, and not LLVM either). At the moment, I am most interested in writing a thesis. And since this is the topic, this means that I am interested, yes. Especially how you generalized it. It might also be interesting for my advisor and the professor. Otherwise, we might end up in doing redundant work. Btw, if it is a free compiler, why not telling us which one? Regards, Michael Kruse -- Tardyzentrismus verboten! smime.p7s Description: S/MIME Cryptographic Signature
Re: Register Pressure in Instruction Level Parallelism
Vladimir Makarov wrote: Michael Kruse wrote: If the register sufficiency is higher than the physical number of registers, spill code is added to the graph. For me, that is the most interesting part, unfortunately Touti (as I remember) in his thesis say practically nothing about this. In the thesis, a modified Poletto algorithm is presented to add spill code. So, now my questions: How much do you think could this could improve compiled code speed? Would the current GCC/YARA benefit from such an optimization pass at all? I think nobody can answer the questions until it is implemented. My main intention to ask this is that somebody might have said, that it was not worth the effort. Therefore, I could have saved me a lot of work. If you are going to work on this project, some small advice about evaluating register sufficiency. I found that register pressure is already practically minimized before insn-scheduling (I suspect that it is mostly done in TER). By the way, it also means that tree balancing (which is actually much more complicated issue because it is not tree but DAG) is not necessary for the backend as it was done in Preston Briggs project (and mentioned in proposed Ken Zadeck's pass stack). Thank you. I am grateful for any advice. What are the chances that this could get into the main GCC tree if it shows up to be an improvement? I don't see any problem to get the code into main GCC tree if you get even 1-2% improvement. Although there are some technical questions (like code fitting into gcc practice standards) and commitment to maintain the code. But this problems could be definitely overcome. I'd be willing to do this. I'd prefer to implement this for the gcc, but my advisor wants me to do it for the university's own compiler. Therefore I could also need arguments why to do it for the GCC. Personally, I'd love to see this work done in GCC. I believe the research work in compiler field should be done in real industrial compilers because that is a final target of the research. I saw many times that researchers report big improvements of their algorithms because they are using toy compilers but when the same algorithms are implemented in GCC they give practically nothing. For me a toy compiler criteria is defined how good they are on some generally used compiler benchmarks as SPEC (the better existing compiler performance, the harder to get the same percent improvement). So if your university compiler performance is close for example to SPECFP2000, I think you could use it to get a real optimization impact. On the other hand, you definitely will get better numbers (and spending less time) using a toy compiler than GCC and your research probably could look more successful with the academic point of view. I don't think it a toy project. Industry is involved (embedded systems) and it also has multiple back-ends. The problem with it is, that it is (at least partially) proprietary. And I don't know about the other part. However, you can't download it on the web. Regards, Michael Kruse -- Tardyzentrismus verboten! smime.p7s Description: S/MIME Cryptographic Signature
Re: Register Pressure in Instruction Level Parallelism
Vladimir Makarov schrieb: Michael Kruse wrote: In the thesis, a modified Poletto algorithm is presented to add spill code. I've just checked the thesis again. I don't think decreasing register pressure through spilling will work well. First of all Polleto linear scan RA is worse than Chaitin-Briggs approach. Even its major improvement extending linear scan is worse than Chaitin-Briggs approach. My experience with an ELS implementation in GCC has shown me this although in original article about ELS the opposite is stated (the comparison in the article was done in GCC but with the new ra project which was unsuccessful implementation of Chaitin-Briggs RA and it was done only on ppc64. I am sure that on x86/x86_64 ELS would behave even worse). That is about basic RA spill in Touti's thesis. The bigger problem is that decreasing register pressure does not take live range splitting into account what good modern RAs do. With this point of view, an approach for register pressure decrease in Bob Morgan's book is more perspective because it does also live range splitting (by the way, I tried Morgan's approach with the old RA more than 5 year ago and did not look compelling for me that time). That this algorithm is used in the thesis does not mean that I have to use that approach. Part of my thesis is also to evaluate different heuristics and compare them to each other. This one would something I'd try. Could you tell us what compiler is this? Unfortunately not (yet). I have just very few information on my own. They just keep telling me how great it is. But I will get the source this week. Regards, Michael Kruse -- Tardyzentrismus verboten! smime.p7s Description: S/MIME Cryptographic Signature