Register Pressure in Instruction Level Parallelism

2009-06-28 Thread Michael Kruse

Hello GCC developers,

I am going to write a Master's Thesis about instruction scheduling based 
on the technique presented in [1]. This will also include a implementation.


The basic idea is to have an additional pass on the data dependency 
graph before instruction scheduling and register allocation takes place. 
It estimates the minimal (register sufficiency) and maximal number of 
registers (register saturation) required by schedules based on that graph.


If the register sufficiency is higher than the physical number of 
registers, spill code is added to the graph.


If the register saturation is higher than the physical number of 
registers, artificial dependencies are added to the graph, such that the 
instruction scheduler is forced to generate a schedule with less registers.


The aim is that the instruction scheduler can focus on the optimal 
arrangement of instructions to exploit a maximal amount of instruction 
level parallelism. [1] also suggests heuristics for all these problems. 
According to the author, these are "nearly optimal" in practice. The 
heuristics for estimating register sufficiency and saturation are both 
optimistic, meaning that it might still be necessary to add more spill 
code to the final code. Hence, this this technique is just an 
optimization pass and doesn't replace existing register allocation or 
instruction scheduling passes.


[1] also has a part about loop scheduling, but my thesis will be about 
basic blocks only. See [2] for an presentation of this technique.


So, now my questions: How much do you think could this could improve 
compiled code speed? Would the current GCC/YARA benefit from such an 
optimization pass at all? What are the chances that this could get into 
the main GCC tree if it shows up to be an improvement?


There has been a short discussion on this mailing list already [3] some 
years ago (Note if you read this: intLP has been used to compare the 
heuristic to the optimal result only). To my knowledge, there hasn't 
been any more on this topic yet (anywhere).


I'd prefer to implement this for the gcc, but my advisor wants me to do 
it for the university's own compiler. Therefore I could also need 
arguments why to do it for the GCC.


Regards,
Michael Kruse

[1] http://www.prism.uvsq.fr/~touati/thesis.html
[2] http://tel.archives-ouvertes.fr/docs/00/04/72/88/ANNEX/tel-7405.ppt
[3] http://gcc.gnu.org/ml/gcc/2002-07/msg00565.html

--
Tardyzentrismus verboten!



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Register Pressure in Instruction Level Parallelism

2009-06-28 Thread Michael Kruse

[Posting this again because I noticed that I sent this to Dave Korn only]

Hi Dave,

Dave Korn wrote:
>   One of the major problems in gcc is the intertangling of instruction
> selection with register allocation and spill generation.  If these 
could be
> separated it would almost certainly generate better code and be 
welcomed with

> open arms!
>  
The separation of these is one concern of the thesis. Although, it does

not separate them completely.

>> I'd prefer to implement this for the gcc, but my advisor wants me to do
>> it for the university's own compiler. Therefore I could also need
>> arguments why to do it for the GCC.
>>
>
>   Because destroying reload(*) would be an incalculable public 
service and
> your name will be remembered in history as the one who slew the 
dragon? ;-)
>  
Yeah, I already read the reload topic in the wiki ("...equivalent of

Satan..." *g*). And it made me think about whether I really want do do
that. But the good thing (for me) is that I don't have for change the
reload pass for this as it is an additional pass, not a replacement. So
I have to disappoint you here.

That does not mean that this couldn't help in getting rid of the reload
pass. After the modification of the ddg, the reload pass doesn't have to
take care for optimization (very much) as this has already been done.
Hence, this could greatly simplify the process.

And I love your optimism :-)

Btw, I guess my advisor doesn't accept your argument. The dragon on my
dragon book is a very tough one *g*. And one of my advisor's arguments for
not implementing it for the GCC is that their compiler would be less
complicated. I can't confirm that since I don't have access to it yet.

Regards,
Michael Kruse

--
Tardyzentrismus verboten!



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Register Pressure in Instruction Level Parallelism

2009-06-28 Thread Michael Kruse

Hi,

Albert Cohen schrieb:


Unfortunately, the state of the art (more recent that the thesis 
referenced in the original email, see Touati's web page) is limited to 
basic block and software-pipelining scopes, and limited to scheduling.
Do have any specific publication in mind? At the moment, my concerns are 
basic blocks only. As far as I can see from the web page, you are 
working on an implementation called DDG. I could not find a download 
link (except for some headers and examples).


However, I will strongly advise anybody (= Kenny Zadeck) looking at a 
fresh SSA-based backend design to consider such an approach to avoid 
messing up with pressure-aware-* where * is any backend pass that 
bears a risk of blowing up register pressure.
I don't know the deep internals of the gcc yet, but I think it is 
doable. My main work is to evaluate the effectiveness of this technique. 
I don't know of any later passes which introduce new register 
requirements. And even if... the mighty reload pass must be able to 
handle this. I am interested in a comparison to what there has been before.


If you interested in collaboration on the topic, we are about to start 
extending those approaches to general control-flow, and implementing 
it in a production compiler (not GCC, but a free one, and not LLVM 
either).
At the moment, I am most interested in writing a thesis. And since this 
is the topic, this means that I am interested, yes. Especially how you 
generalized it.


It might also be interesting for my advisor and the professor. 
Otherwise, we might end up in doing redundant work.


Btw, if it is a free compiler, why not telling us which one?

Regards,
Michael Kruse

--
Tardyzentrismus verboten!



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Register Pressure in Instruction Level Parallelism

2009-06-29 Thread Michael Kruse



Vladimir Makarov wrote:

Michael Kruse wrote:
If the register sufficiency is higher than the physical number of 
registers, spill code is added to the graph.


For me, that is the most interesting part, unfortunately Touti (as I 
remember) in his thesis say practically nothing about this.

In the thesis, a modified Poletto algorithm is presented to add spill code.



So, now my questions: How much do you think could this could improve 
compiled code speed? Would the current GCC/YARA benefit from such an 
optimization pass at all?

I think nobody can answer the questions until it is implemented.
My main intention to ask this is that somebody might have said, that it 
was not worth the effort. Therefore, I could have saved me a lot of work.


If you are going to work on this project, some small advice about 
evaluating register sufficiency.  I found that register pressure is 
already practically minimized before insn-scheduling (I suspect that 
it is mostly done in TER).  By the way, it also means that tree 
balancing (which is actually much more complicated issue because it is 
not tree but DAG) is not necessary for the backend as it was done in 
Preston Briggs project (and mentioned in proposed Ken Zadeck's pass 
stack).

Thank you. I am grateful for any advice.
What are the chances that this could get into the main GCC tree if it 
shows up to be an improvement?
I don't see any problem to get the code into main GCC tree if you get 
even 1-2% improvement.  Although there are some technical questions 
(like code fitting into gcc practice standards) and  commitment to 
maintain the code.  But this problems could be definitely overcome.

I'd be willing to do this.
I'd prefer to implement this for the gcc, but my advisor wants me to 
do it for the university's own compiler. Therefore I could also need 
arguments why to do it for the GCC.


Personally, I'd love to see this work done in GCC. I believe the 
research work in compiler field should be done in real industrial 
compilers because that is a final target of the research.  I saw many 
times that researchers report big improvements of their algorithms 
because they are using toy compilers but when the same algorithms are 
implemented in GCC they give practically nothing.  For me a toy 
compiler criteria is defined how good they are on some generally used 
compiler benchmarks as SPEC (the better existing compiler performance, 
the harder to get the same percent improvement).  So if your 
university compiler performance is close for example to SPECFP2000,  I 
think you could use it to get a real optimization impact.


On the other hand, you definitely will get better numbers (and 
spending less time) using a toy compiler than GCC and your research 
probably could look more successful with the academic point of view.
I don't think it a toy project. Industry is involved (embedded systems) 
and it also has multiple back-ends. The problem with it is, that it is 
(at least partially) proprietary. And I don't know about the other part. 
However, you can't download it on the web.


Regards,
Michael Kruse

--
Tardyzentrismus verboten!



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Register Pressure in Instruction Level Parallelism

2009-06-29 Thread Michael Kruse



Vladimir Makarov schrieb:

Michael Kruse wrote:
In the thesis, a modified Poletto algorithm is presented to add spill 
code.
I've just checked the thesis again.  I don't think decreasing register 
pressure through spilling will work well.  First of all Polleto linear 
scan RA is worse than Chaitin-Briggs approach.  Even its major 
improvement extending linear scan is worse than Chaitin-Briggs 
approach.  My experience with an ELS implementation in GCC has shown 
me this although in original article about ELS the opposite is stated 
(the comparison in the article was done in GCC but with the new ra 
project which was unsuccessful implementation of Chaitin-Briggs RA and 
it was done only on ppc64.  I am sure that on x86/x86_64 ELS would 
behave even worse).  That is about basic RA spill in Touti's thesis.


The bigger problem is that decreasing register pressure does not take 
live range splitting into account what good modern RAs do.  With this 
point of view, an approach for register pressure decrease in Bob 
Morgan's book is more perspective because it does also live range 
splitting (by the way, I tried Morgan's approach with the old RA more 
than 5 year ago and did not look compelling for me that time).
That this algorithm is used in the thesis does not mean that I have to 
use that approach. Part of my thesis is also to evaluate different 
heuristics and compare them to each other. This one would something I'd try.

Could you tell us what compiler is this?
Unfortunately not (yet). I have just very few information on my own. 
They just keep telling me how great it is. But I will get the source 
this week.


Regards,
Michael Kruse

--
Tardyzentrismus verboten!



smime.p7s
Description: S/MIME Cryptographic Signature