date:20140515

Re: Live range shrinkage in pre-reload scheduling

2014-05-15 Thread Richard Sandiford

Ramana Radhakrishnan  writes:
> On Wed, May 14, 2014 at 5:38 PM, Richard Sandiford
>  wrote:
>> Hey, I resent that.  You make it sound I came up with SCHED_PRESSURE_MODEL
>> on a whim without any evidence to back it up.  I implemented it because
>> it gave better EEMBC results on ARM, at least at the time that I wrote
>> it, and it didn't effect SPEC2000 for ARM much one way or the other.
>> It also produced better results for s390x on SPEC2006 at the time it
>> was tested, which is why it was turned on by default there too.
>>
>> For anyone interested in the background and rationale, the original
>> posting was here: https://gcc.gnu.org/ml/gcc-patches/2011-12/msg01684.html
>
> I no longer have those results,

I don't have mine either unfortunately.

> the reason we turned them on were
> because IIRC there were significant improvements on A8 and A9 for this
> new weighted algorithm and it seemed to work well.
>
>>
>> I'm not claiming it's a great heuristic or anything.  There's bound to
>> be room for improvement.  But it was based on "reality" and real results.
>>
>> Of course, if it turns out not be a win for ARM or s390x any more then it
>> should be disabled.
>
> The current situation that Kyrill is investigating is a case where we
> notice the first scheduler pass being a bit too aggressive with
> creating ILP opportunities with the A15 scheduler that causes
> performance differences with not turning on the first scheduler pass
> vs using the defaults.

Ah, OK.  "model" was a deliberate attempt to be less conservative
than "weighted", but it sounds like it went too far in your A15 case.
Have you got a testcase you can share?

>>> In this relation I am remembering a story told me by Bob Morgan about
>>> bin packing RA invention.  It was just a quick and simple first RA
>>> implementation for a new compiler.  After that DEC compiler team tried
>>> many times to improve the RA implementing more complicated optimizations
>>> but the first bin packing RA was always better.
>>
>> You make it sound like your original -fsched-pressure is unlikely
>> to be beaten, in the way that you think bin packing wasn't beaten.
>> But both versions of -fsched-pressure are off by default on most
>> targets for a reason.  (AFAIK the only two targets that enable it by
>> default are the two that use SCHED_PRESSURE_MODEL: arm and s390x.)
>> I think this is still an area that could be improved.  I don't mind
>> whether that's through improving one of the two existing heuristics
>> or doing something different, but it seems pessimistic to say that
>> scheduling based on register pressure is always going to be the optional
>> feature that it is now.
>>
>> E.g. tracking pressure classes isn't always the right thing for
>> targets like PowerPC where only part of the vector register set
>> can be used for floating-point operations.
>
> Is there another post that deals with this particular case ? I tried
> digging through the archives but couldn't find anything easily enough.

Not sure I ever sent it in the end.  I got pulled off Linaro before I
could properly finish this stuff off.  But IIRC the problem was that
both algorithms use IRA pressure classes to calculate the pressure.
If VSX is enabled then the pressure class for both scalar floating-point
and vector operations will be VSX_REGS.  But the scalar operations can
only use half of those registers.

So in float-heavy code, the heuristics would assume that there are
twice as many registers available as there really are.  This hurt
SCHED_PRESSURE_MODEL much more than SCHED_PRESSURE_WEIGHTED because
*_MODEL was supposed to be less conservative.  I.e. *_MODEL would
encourage ILP to the point of using twice the number of actual registers.

IIRC the biggest reason that this didn't affect *_WEIGHTED as much
was that it deliberately treated a register as being live indefinitely
if the register is used more than once.  This is the thing I mentioned
in the write-up that Vlad said had also improved x86_64.

Thanks,
Richard

RE: Live Range Splitting in Integrated Register Allocator

2014-05-15 Thread Ajit Kumar Agarwal



On 2014-05-14, 1:33 PM, Ajit Kumar Agarwal wrote:

>
> Hello All:
>
> I am planning to implement the Live range splitting based on the following 
> cases in the Integrated Register Allocator.
>
> For a given Live range that spans from  from outer region to inner region of 
> the loop. Such Live ranges which are LiveIn at the entry of the header of the 
> Loop and Live Out at the exit of the loop but there are no references inside 
> the  Loop. Such Live ranges lead to unoptimal spill and fetch inside the Loop 
> conflicting with the shorter live ranges that spans inside the Loop.
>
> Lets say such Live range as L1. L1 can be splitted at the Loop Boundary 
> splitting the Live range by making a store at the header of the Loop and the 
> Load at the exit of the Loop. This makes the Live range less conflicting with 
> the Live ranges that are local to the Loop regions reducing the spill and 
> Fetch inside the Loops.
>
>  From the code and documentation of Integrated Register Allocator following 
> is the understanding.
>
> As Live range L1 is live in the outer region but as there are no reference 
> inside the Loop region. Since the allocno for L1 for a given variable v is 
> assigned two allocno v1 and v2 . V1 being assigned allocno for the outer 
> region and v2 as allocno for the inner Loop region. This allows to accumulate 
> the information from the inner loop region to outer region.
>
> Will the current Integrated Register Allocator will consider the Live range 
> L1 as Live inside the Loop and outer region? If Yes then there will be 
> conflicting with the Live ranges that are local to the Loop region leading to 
> spill and fetch inside the Loop.  If the v1 and v2 allocno are created v1 for 
> the outer region and v2 for the inner region then there will v2 will be 
> conflicting the local live ranges inside the Loop region and v1 will be 
> conflicting with the Live ranges of the outer regions.  This is how its been 
> considered as Live range splitting at the Loop Boundary for the Live range 
> that spans inside the Loop but not not being referenced?
>
> If Such cases are not being considered in the Integrated Register Allocator, 
> then it will be useful to implement such cases in IRA which will be 
> benefitted the microblaze target.
>
> Please let me know what do you think.
>

>>Allocno v2 corresponding to live range inside the loop has a very small cost 
>>for spilling therefore it will be spilled if we still need registers to 
>>pseudos local to the loop.  If allocno v1 >>corresponding live ranges outside 
>>the loop *and* inside the loop gets a hard register, we will have live range 
>>splitting as you propose.  So I do not see a necessity for the optimization 
you propose.

If the allocno v1 does n't get the hard register and become the candidate of 
spill which is Live that spans the Loop but not reference inside the Loop, it 
will lead to spills inside the loop which become bottleneck as performance is 
concerned. Does this case also handled in the IRA? With my proposal the load 
and store will be moved outside the loop instead of having it inside the Loop 
for such case.

>>Moreover my experience shows that making a lot of explicit transformations 
>>(e.g. proposed splitting) even if we have transformations to undo them (e.g. 
>>coalescing) results in worse >>code. 
>>The explicit transformations should be as minimal as possible during RA in a 
>>good register allocator.  So I guess the optimization you propose will 
>>actually results in a worse code.  >>Although I might be wrong because it is 
>>always hard to predict the result of heuristic optimizations.

There is a paper on Improved Passive Splitting which does take care of 
Splitting at Loop boundaries for the Live range that spans the Loop but not 
referenced.

Improved Passive Splitting (2005) by Keith D. Cooper , Jason Eckhardt. This 
paper uses the heuristics and claim to reduce the spill and fetch for the case 
I propose. Also claims to generate the better code.

>>What is really missed in RA, it is a good splitting on BB boundaries. I have 
>>plans to try this as a part of more common pass to decrease register pressure 
>>on which I'll start work soon.

>>In any way thanks for the proposal.  You are free try it to confirm or reject 
>>my prediction.  Unfortunately, that is the only way to be sure about the 
>>result.

Thanks & Regards
Ajit

Re: Live range shrinkage in pre-reload scheduling

2014-05-15 Thread Maxim Kuvyrkov

On May 15, 2014, at 6:46 PM, Ramana Radhakrishnan  
wrote:
> 
>> 
>> I'm not claiming it's a great heuristic or anything.  There's bound to
>> be room for improvement.  But it was based on "reality" and real results.
>> 
>> Of course, if it turns out not be a win for ARM or s390x any more then it
>> should be disabled.
> 
> The current situation that Kyrill is investigating is a case where we
> notice the first scheduler pass being a bit too aggressive with
> creating ILP opportunities with the A15 scheduler that causes
> performance differences with not turning on the first scheduler pass
> vs using the defaults.

Charles has a work-in-progress patch that fixes a bug in SCHED_PRESSURE_MODEL 
that causes the above symptoms.  The bug causes 1st scheduler to unnecessarily 
increase live ranges of pseudo registers when there are a lot of instructions 
in the ready list.

Charles, can you finish your patch in the next several days and post it for 
review?

Thank you,

--
Maxim Kuvyrkov
www.linaro.org

Obscure iostream failure, a compiler bug or a language design issue?

2014-05-15 Thread Ulrich Windl

Hello!

I had a C++ program that crashed, so I added debug output using clog. 
Unfortunately the first out put to clog also crashed within iostreams.
I was able to make a test case that compiles fine with -Wall -Wextra, but 
crashes:
---ugly code follows---
class d {
public:
static void f();
d() { f(); }
};

d dd;

#include
using namespace std;
void d::f() { clog << "jdfdf" << endl; }
int main(int argc, char *argv[])
{
d::f();
}
---

If you #include  before the decalration of "dd", the program works; 
if compiled as displayed, it crashes:
---
Program received signal SIGSEGV, Segmentation fault.
0x77b6f501 in std::ostream::sentry::sentry(std::ostream&) ()
   from /usr/lib64/libstdc++.so.6
(gdb) bt
#0  0x77b6f501 in std::ostream::sentry::sentry(std::ostream&) ()
   from /usr/lib64/libstdc++.so.6
#1  0x77b6fc19 in std::basic_ostream >& 
std::__ostream_insert >(std::basic_ostream >&, char const*, long) () from /usr/lib64/libstdc++.so.6
#2  0x77b7001f in std::basic_ostream >& 
std::operator<<  >(std::basic_ostream >&, char const*) () from /usr/lib64/libstdc++.so.6
#3  0x00400942 in d::f () at t.cc:11
#4  0x00400a6a in d::d (this=0x6011a0 ) at t.cc:4
#5  0x004009d9 in __static_initialization_and_destruction_0 (
__initialize_p=1, __priority=65535) at t.cc:7
#6  0x00400a33 in global constructors keyed to dd() () at t.cc:15
#7  0x00400b46 in __do_global_ctors_aux ()
#8  0x00400783 in _init ()
#9  0x772dcad8 in ?? () from /lib64/libc.so.6
#10 0x00400ad5 in __libc_csu_init (argc=-7936,
argv=0x601080 <_ZSt4clog@@GLIBCXX_3.4>, envp=0x0) at elf-init.c:120
#11 0x772ebbc2 in __libc_start_main () from /lib64/libc.so.6
#12 0x00400859 in _start () at ../sysdeps/x86_64/elf/start.S:113
---

gcc is "gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux" of 
SLES11 SP3 (x86_64), but newer compilers react the same.

I guess the problem is that constructors are called sequentially without 
dependency analysis. So if d() is acaaled before including iostream 
(lexically), the program crashes.

Reagrds,
Ulrich Windl
P.S. Usually I don't write such ugly code

Re: problem with fprintf and OpenMP

2014-05-15 Thread Leif Leonhardy


On 15.05.2014 08:01, Siegmar Gross wrote:

#include 
#include 
#include 

int main (void)
{
   #pragma omp parallel default (none)
   fprintf (stderr, "Hello!\n");
   return EXIT_SUCCESS;
}


I get the following error, when I try to compile the program
on "Solaris 10 Sparc".

tyr OpenMP 116 \gcc -fopenmp omp_fprintf.c
In file included from /usr/include/stdio.h:66:0,
  from omp_fprintf.c:38:
omp_fprintf.c: In function 'main':
omp_fprintf.c:45:12: error: '__iob' not specified in enclosing parallel
fprintf (stderr, "Hello!\n");
 ^
omp_fprintf.c:44:11: error: enclosing parallel
#pragma omp parallel default (none)
^
tyr OpenMP 117


I can solve the problem if I add "shared(__iob)" to "#pragma",
but then the program isn't portable any longer.

tyr OpenMP 118 grep stderr /usr/include/iso/stdio_iso.h
#define stderr  (&__iob[2])
#define stderr  (&_iob[2])
tyr OpenMP 119


I have a similar problem with Linux and Mac OS X, which  need
a different solution, because "stderr" is defined in a different
way.


To get around this, you can add

  FILE *fp=stderr;

and use fp instead of stderr, also adding shared(fp).


-leif

Re: Obscure iostream failure, a compiler bug or a language design issue?

2014-05-15 Thread Florian Weimer


On 05/15/2014 09:58 AM, Ulrich Windl wrote:


I guess the problem is that constructors are called sequentially without 
dependency analysis. So if d() is acaaled before including iostream 
(lexically), the program crashes.


Yes, that's right.  The C++ mechanism for initialization of global 
variables is fairly primitive.  I'm not sure if GCC can do much about 
that because it's more or less what programmers expect, and more 
reliable mechanisms would have a performance impact (as long as dynamic 
linking with independent recompilation is supported).


(By the way, such questions are more appropriate for the gcc-help 
mailing list.)


--
Florian Weimer / Red Hat Product Security Team

Re: problem with fprintf and OpenMP

2014-05-15 Thread Jakub Jelinek

On Thu, May 15, 2014 at 08:01:13AM +0200, Siegmar Gross wrote:
> I'm using gcc-4.9.0 and have a problem with the following program.
> I have reported the problem allready on gcc-help some days ago,
> but didn't get any replies. Perhaps somebody in this list knows,
> if the behaviour is intended.

Yes, this is what OpenMP standard requires.

> #include 
> #include 
> #include 
> 
> int main (void)
> {
>   #pragma omp parallel default (none)
>   fprintf (stderr, "Hello!\n");
>   return EXIT_SUCCESS;
> }
> 
> 
> I can solve the problem if I add "shared(__iob)" to "#pragma",
> but then the program isn't portable any longer.

Or don't use default (none), or don't use stderr directly in the
default(none) parallel region (use a temporary variable, call
a function that implicitly uses stderr, etc.).

> In my opinion the compiler knows that I use OpenMP and that
> I use "default(none)" so that it can automatically add all
> necessary internal variables to the "shared clause" or that

Please define what is internal variable.  __iob is for the compiler
not an internal variable of any kind, stderr is a macro like any other.
Note that e.g. on Linux stderr is a macro and a variable
(#define stderr stderr), treating in that case stderr as an internal
variable would be really weird.

Jakub

Re: [GSoC] writing test-case

2014-05-15 Thread Richard Biener

On Thu, May 15, 2014 at 12:30 AM, Prathamesh Kulkarni
 wrote:
> On Wed, May 14, 2014 at 3:54 PM, Richard Biener
>  wrote:
>> On Tue, May 13, 2014 at 11:06 PM, Prathamesh Kulkarni
>>  wrote:
>>> On Tue, May 13, 2014 at 2:36 PM, Richard Biener
>>>  wrote:
 On Sun, May 11, 2014 at 5:00 PM, Prathamesh Kulkarni
  wrote:
> On Sun, May 11, 2014 at 8:10 PM, Andreas Schwab  
> wrote:
>> Prathamesh Kulkarni  writes:
>>
>>> a) I am not able to follow why 3 slashes are required here
>>> in x_.\\\(D\\\) ? Why does x_.\(D\) not work ?
>>
>> Two of the three backslashes are eaten by the tcl parser.  But actually
>> only two backslashes are needed, since the parens are not special to tcl
>> (but are special to the regexp engine, so you want a single backslash
>> surviving the tcl parser).
>>
>>> b) The expression after folding would be of the form:
>>> t2_ = x_(D) - y_(D)
>>> I have used the operator "." in the pattern to match digit.
>>> While that works in the above case, I think a better
>>> idea would be to match using [0-9].
>>> I tried the following but it does not work:
>>> t_[0-9] = x_[0-9]\\\(D\\\) - y_[0-9]\\\(D\\\)
>>> Neither does \\\[ and \\\] work.
>>
>> Brackets are special in tcl (including inside double quotes), so they
>> need to be quoted.  But you want the brackets to appear unquoted to the
>> regexp engine, so a single backslash will do the Right Thing.
>>
>> See tcl(n) for the tcl parsing rules.
>>
> Thanks. Now I get it, the double backslash \\ is an escape sequence
> for \, and special characters like (, [
> retain their meaning in quotes, so to match input text: (D), the
> pattern has to be written as: "\\(D\\)".
> I believe "\(D\)" would only match D in the input ?
> I have modified the test-case. Is this version correct ?

 I usually verify that by running the testcase in isolation on a GCC version
 that should FAIL it and on one that should PASS it (tcl quoting is also
 try-and-error for me most of the time...).

 Thus I do

 gcc/> make check-gcc RUNTESTFLAGS="tree-ssa.exp=match-2.c"
 
 
 gcc/> make cc1
 ... compiles cc1 ...
 gcc/> make check-gcc RUNTESTFLAGS="tree-ssa.exp=match-2.c"
 

 A more complete matching for an SSA name would be (allowing
 for SSA name versions > 9) _\\d\+ with \\(D\\) appended if
 suitable (that's usually known from the testcase).  \\d\+ should match
 at least one decimal digit.
>>> I thought that SSA name version wouldn't exceed 9 for that test-case,
>>> so I decided for matching only one digit. I will change it to match
>>> one or more digits.
>>>
>>> * I have written test-cases for patterns in match.pd (attached patch), which
>>> result in PASS. Could you review them for me ?
>>> I couldn't write for following ones:
>>>
>>> 1] Patterns involving COMPLEX_EXPR, REALPART_EXPR, IMAGPART_EXPR
>>> (match_and_simplify
>>>   (complex (realpart @0) (imagpart @0))
>>>   @0)
>>> (match_and_simplify
>>>   (realpart (complex @0 @1))
>>>   @0)
>>> (match_and_simplify
>>>   (imagpart (complex @0 @1))
>>>   @1)
>>>
>>> Sorry to be daft, but I couldn't understand what these patterns meant
>>> (I know complex numbers).
>>> Could you give an example that matches one of these patterns ?
>>> Thanks.
>>
>> The existing match-1.c testcase has some ideas.  For the first
>> pattern I'd do
>>
>> _Complex double foo (_Complex double z)
>> {
>>   double r = __real z;
>>   double i = __imag z;
>>   return r + 1.0iF * i;
>> }
>>
>> where the return expression is folded (yeah ...) to a COMPLEX_EXPR.
>>
>> For the other two patterns sth like
>>
>> double foo (double r)
>> {
>>   _Complex double z = r;
>>   return __real z;
>> }
>>
>> and
>>
>> double foo (double i)
>> {
>>   _Complex double z = 1.0iF * i;
>>   return __imag z;
>> }
>>
>> should work.
>>
> Thanks. Now I understood the meaning of patterns.
> The first pattern should return z instead of returning a new complex
> number from r and i.
> however the test-case doesn't appear to work.

It probably needs building with -ffast-math, the building of the
COMPLEX_EXPR with r + 1.0iF * i triggers interesting corner-cases
otherwise.  An alternative is to write

_Complex double foo (_Complex double z)
{
  double r = __real z;
  double i = __imag z;
  _Complex double tem;
  __real tem = r;
  __imag tem = i;
  return tem;
}

> The other two transforms real (complex x) -> real x and imag (complex
> x) -> imag x were simplified.
>
>>> 2] Test-case for FMA_EXPR. I am not sure how to generate FMA_EXPR from C 
>>> code.
>>> (match_and_simplify
>>>   (fma INTEGER_CST_P@0 INTEGER_CST_P@1 @3)
>>>   (plus (mult @0 @1) @3))
>>
>> I believe it's not possible.  FMA is matched by the optimize_widen_mult
>> pass which runs quite late, after the last forwprop pass.  So I don't think
>> it's possible to write a testcase that triggers with the existing c

RE: Why is this not optimized?

2014-05-15 Thread Bingfeng Mei

Thanks for the reply. I will look at the patch. As far as the cost is 
concerned, I think fwprop doesn't really need to understand pipeline model. As 
long as rtx costs after optimization is less than before optimization, I think 
it is good enough. Of course, it won't be better in every case, but should be 
better in general.

Cheers,
Bingfeng

-Original Message-
From: Bin.Cheng [mailto:amker.ch...@gmail.com] 
Sent: 15 May 2014 06:59
To: Bingfeng Mei
Cc: gcc@gcc.gnu.org
Subject: Re: Why is this not optimized?

On Wed, May 14, 2014 at 9:14 PM, Bingfeng Mei  wrote:
> Hi,
> I am looking at some code of our target, which is not optimized as expected. 
> For the following RTX, I expect source of insn 17 should be propagated into 
> insn 20, and insn 17 is eliminated as a result. On our target, it will become 
> a predicated xor instruction instead of two. Initially, I thought fwprop pass 
> should do this.
>
> (insn 17 16 18 3 (set (reg/v:HI 102 [ crc ])
> (xor:HI (reg/v:HI 108 [ crc ])
> (const_int 16386 [0x4002]))) coremark.c:1632 725 {xorhi3}
>  (nil))
> (insn 18 17 19 3 (set (reg:BI 113)
> (ne:BI (reg:QI 101 [ D.4446 ])
> (const_int 1 [0x1]))) 1397 {cmp_qimode}
>  (nil))
> (jump_insn 19 18 55 3 (set (pc)
> (if_then_else (ne (reg:BI 113)
> (const_int 0 [0]))
> (label_ref 23)
> (pc))) 1477 {cbranchbi4}
>  (expr_list:REG_DEAD (reg:BI 113)
> (expr_list:REG_BR_PROB (const_int 7100 [0x1bbc])
> (expr_list:REG_PRED_WIDTH (const_int 1 [0x1])
> (nil
>  -> 23)
> (note 55 19 20 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
> (insn 20 55 23 4 (set (reg:HI 112 [ crc ])
> (reg/v:HI 102 [ crc ])) 502 {fp_movhi}
>  (expr_list:REG_DEAD (reg/v:HI 102 [ crc ])
> (nil)))
> (code_label 23 20 56 5 2 "" [1 uses])
>
>
> But it can't. First propagate_rtx_1 will return false because PR_CAN_APPEAR 
> is false and
> following code is executed.
>
>   if (x == old_rtx)
> {
>   *px = new_rtx;
>   return can_appear;
> }
>
> Even I forces PR_CAN_APPEAR to be set in flags, fwprop still won't go ahead in
> try_fwprpp_subst because old_cost is 0 (REG only rtx), and set_src_cost 
> (SET_SRC (set),
> speed) is bigger than 0. So the change is deemed as not profitable, which is 
> not correct
> IMO.
Pass fwprop is too conservative with respect to propagation
opportunities outside of memory reference, it just gives up at many
places.  Also as in your case, seems it doesn't take into
consideration that the original insn can be removed after propagation.

We Mi once sent a patch re-implementing fwprop pass at
https://gcc.gnu.org/ml/gcc-patches/2013-03/msg00617.html .
I also did some experiments and worked out a local patch doing similar
work to handle cases exactly like yours.
The problem is even though one instruction can be saved (as in your
case), it's not always good, because it tends to generate more complex
instructions, and such insns are somehow more vulnerable to pipeline
hazard.  Unfortunately, it's kind of impossible for fwprop to
understand the pipeline risk.

Thanks,
bin
>
> If fwprop is not the place to do this optimization, where should it be done? 
> I am working on up-to-date GCC 4.8.
>
> Thanks,
> Bingfeng Mei



-- 
Best Regards.

Re: Live range shrinkage in pre-reload scheduling

2014-05-15 Thread Ramana Radhakrishnan

On Thu, May 15, 2014 at 8:36 AM, Maxim Kuvyrkov
 wrote:
> On May 15, 2014, at 6:46 PM, Ramana Radhakrishnan  
> wrote:
>>
>>>
>>> I'm not claiming it's a great heuristic or anything.  There's bound to
>>> be room for improvement.  But it was based on "reality" and real results.
>>>
>>> Of course, if it turns out not be a win for ARM or s390x any more then it
>>> should be disabled.
>>
>> The current situation that Kyrill is investigating is a case where we
>> notice the first scheduler pass being a bit too aggressive with
>> creating ILP opportunities with the A15 scheduler that causes
>> performance differences with not turning on the first scheduler pass
>> vs using the defaults.
>
> Charles has a work-in-progress patch that fixes a bug in SCHED_PRESSURE_MODEL 
> that causes the above symptoms.  The bug causes 1st scheduler to 
> unnecessarily increase live ranges of pseudo registers when there are a lot 
> of instructions in the ready list.

Is this something that you've seen shows up in general integer code as
well ? Do you or Charles have an example for us to look at ? I'm not
sure what "lot of instructions in the ready list" really means here.
The specific case Kyrill's been looking into is Dhrystone Proc_8 when
tuned for a Cortex-A15 with neon and float-abi=hard but I am not sure
if that has "too many instructions" :) .

Kyrill, could you also look into the other cases we have from SPEC2k
where we see this as well and come back with any specific testcases
that Charles / Richard could also take a look into.

>
> Charles, can you finish your patch in the next several days and post it for 
> review?

I think we'll await this but we'll go look into some of the benchmarks.

Ramana

>
> Thank you,
>
> --
> Maxim Kuvyrkov
> www.linaro.org
>
>

Re: [GSoC] writing test-case

2014-05-15 Thread Richard Biener

>> So I came along the need to add another predicate for REAL_CST
>> leafs which makes me wonder if we should support tree codes
>> as predicates.  Thus instead of writing
>>
>> (match_and_simplify
>>   (plus (plus @0 INTEGER_CST_P@1) INTEGER_CST_P@2)
>>   (plus @0 (plus @1 @2)))
>>
>> write
>>
>> (match_and_simplify
>>   (plus (plus @0 INTEGER_CST@1) INTEGER_CST@2)
>>   (plus @0 (plus @1 @2)))
>>
>> we have conveniently choosen lower-case without _EXPR for
>> expression codes thus the uppercase original form is available
>> to be used as pre-defined predicate (code generated is
>> TREE_CODE (x) == ).
>>
>> Just an idea ...
> It appears to be good. However would we want predicates that are
> slightly more complex ?
> For example, integer_onep @0 ?

In principle predicates are not different from expressions:

 (plus @0 INTEGER_CST)

is similar to

 (plus @0 (INTEGER_CST))

that is, if INTEGER_CST is treated as zero-operand "expression".

> I was wondering if it would be a good idea to make predicate an
> attribute of operand (expr, capture)
> rather than keep predicate as a separate operand ? (since predicates
> are only used for checking the operand they are
> applied to).
>
> The hierarchy be something like:
>
> struct predicate_operand: public operand
> {
>   predicate_operand (const char *ident_, enum operand::type ty) :
> operand (ty), ident (ident_) {}
>   const char *ident;
>   virtual void gen_gimple_match (FILE *f, const char *, const char *);
>   virtual void gen_gimple_transform (FILE *, const char *) = 0;
> };
>
> struct capture : public predicate_operand
> {
>   capture (const char *where_, operand *what_, const char *pred = 0)
>   : predicate_operand (pred, OP_CAPTURE), where (where_), what (what_) {}
>   const char *where;
>   operand *what;
>   virtual void gen_gimple_match (FILE *f, const char *, const char *);
>   virtual void gen_gimple_transform (FILE *f, const char *);
> };
> And make capture::gen_gimple_match call
> predicate_operand::gen_gimple_match if ident is non-null.
> similarly for expr.

I don't understand.  predicate is already derived from operand, so is
capture.  How would you write

(plus @0 integer_zerop)

differently?  Here 'integer_zerop' is an operand of the plus expression.

To me predicate (and capture without expression or predicate)
differs from expression in that predicate is clearly a leaf of the
expression tree while we have to recurse into expression operands.

Now, if we want to support applying predicates to the midst of an
expression, like

(plus predicate(minus @0 @1)
@2)
(...)

then this would no longer be true.  At the moment you'd write

(plus (minus@3 @0 @1)
@2)
  if (predicate (@3))
(...)

which makes it clearer IMHO (with the decision tree building
you'd apply the predicates after matching the expression tree
anyway I suppose, so code generation would be equivalent).

One exception may be types of (sub-)expressions which we
can also check cheaply (and eventually build into the decision
tree) (cheaply == doesn't involve function calls).

So for the if-expressions like

(match_and_simplify
  (plus @0 (negate @1))
  if (!TYPE_SATURATING (TREE_TYPE (@0)))
  (minus @0 @1))

we'd ideally want to group all patterns with the same
type-related if-expression to improve the matching
process.  That eventually asks for sth like

if (!TYPE_SATURATING (TREE_TYPE (@0)))
  {
  (match_and_simplify
   (plus @0 (negate @1))
  (minus @0 @1))

  (match_and_simplify
   (plus (negate @0) @1)
  (minus @1 @0)
  }

thus have multiple patterns with a common "outer" if-expression.
If you look into fold-const.c we have this kind of grouping used
there to group for example cases valid only for !FLOAT_TYPE_P
or INTEGRAL_TYPE_P.

IMHO we can think about this when the individual if-expressions
become too many.  I don't like the idea of applying predicates
to non-leafs in the match expression itself.

Richard.

Re: [GSoC] writing test-case

2014-05-15 Thread Richard Biener

>>> * I have written test-cases for patterns in match.pd (attached patch), which
>>> result in PASS. Could you review them for me ?
>>
>> Sure.  It looks good to me, though you can look at the changed match-1.c
>> testcase on the branch where I've changed the matching to look for the
>> debug output the forwprop pass dumps with -fdump-tree-forwprop1-details,
>> that makes sure no other pass before did the transform (you can also
>> move the individual dg-final lines after the testcase function to more
>> easily associate them with a function).
> Thanks, modified the patch to scan for "gimple_match_and_simplified" instead.
>>
>> At some point the testcase should be split up as well.
>>
>> How do you manage your sources at the moment?  Just a svn
>> checkout of the branch with local modifications?
> Yes.

Ok.  I have checked in the patch with the following changelog entry
(applied to gcc/ChangeLog.mas):

2014-05-15  Prathamesh Kulkarni  

* match.pd: Add more cases from tree-ssa-forwprop.c.

testsuite/
* gcc.dg/tree-ssa/match-2.c: New testcase.

Richard.

>>
>> Thanks,
>> Richard.

GCC 4.8.3 Status Report, branch frozen for release (2014-05-15)

2014-05-15 Thread Richard Biener


Status
==

The 4.8 branch is now frozen as I am preparing a first release
candidate for 4.8.3.  All patches to the branch now require
explicit approval from release managers.


Previous Report
===

https://gcc.gnu.org/ml/gcc/2014-05/msg00026.html

GCC 4.8.3 Release Candidate available from gcc.gnu.org

2014-05-15 Thread Richard Biener


GCC 4.8.3 Release Candidate available from gcc.gnu.org

The first release candidate for GCC 4.8.3 is available from

 ftp://gcc.gnu.org/pub/gcc/snapshots/4.8.3-RC-20140515

and shortly its mirrors.  It has been generated from SVN revision 210453.

I have so far bootstrapped and tested the release candidate on
x86_64-linux.  Please test it and report any issues to bugzilla.

If all goes well, I'd like to release 4.8.3 on Friday, 23th.

Re: GCC 4.8.3 Status Report, branch frozen for release (2014-05-15)

2014-05-15 Thread David Edelsohn

On Thu, May 15, 2014 at 5:27 AM, Richard Biener  wrote:
>
> Status
> ==
>
> The 4.8 branch is now frozen as I am preparing a first release
> candidate for 4.8.3.  All patches to the branch now require
> explicit approval from release managers.

Please hold off on GCC 4.8.3.  powerpc-linux has a potentially serious
ABI incompatibility issue related to the HTM support backported to the
4.8 branch.

Thanks, David

GCC driver to "Compile twice, score the assembly, choose the best"?

2014-05-15 Thread Ian Bolton

Hi, fellow GCC developers!

I was wondering if the "gcc" driver could be made to invoke
"cc1" twice, with different flags, and then just keep the
better of the two .s files that comes out?

I'm sure this is not a new idea, but I'm not aware of
anything being done in this area, so I've made this post to
gather your views. :)

The kinds of flags I am thinking could be toggled are
register allocation and instruction scheduling ones, since
it's very hard to find one-size-fits-all there and we don't
really want to have the user depend on knowing the right
one.

Obviously, compilation time will go up, but the run-time
benefits could be huge.

What are your thoughts?  What work in this area have I
failed to dig up in my limited research?

Many thanks,
Ian

Re: GCC driver to "Compile twice, score the assembly, choose the best"?

2014-05-15 Thread Richard Biener

On Thu, May 15, 2014 at 1:46 PM, Ian Bolton  wrote:
> Hi, fellow GCC developers!
>
> I was wondering if the "gcc" driver could be made to invoke
> "cc1" twice, with different flags, and then just keep the
> better of the two .s files that comes out?

I'd be interested in your .s comparison tool that decides which one
is better!

Richard.

> I'm sure this is not a new idea, but I'm not aware of
> anything being done in this area, so I've made this post to
> gather your views. :)
>
> The kinds of flags I am thinking could be toggled are
> register allocation and instruction scheduling ones, since
> it's very hard to find one-size-fits-all there and we don't
> really want to have the user depend on knowing the right
> one.
>
> Obviously, compilation time will go up, but the run-time
> benefits could be huge.
>
> What are your thoughts?  What work in this area have I
> failed to dig up in my limited research?
>
> Many thanks,
> Ian
>
>
>
>

RE: GCC driver to "Compile twice, score the assembly, choose the best"?

2014-05-15 Thread Ian Bolton

Thanks for the quick response.

> On Thu, May 15, 2014 at 1:46 PM, Ian Bolton  wrote:
> > Hi, fellow GCC developers!
> >
> > I was wondering if the "gcc" driver could be made to invoke
> > "cc1" twice, with different flags, and then just keep the
> > better of the two .s files that comes out?
> 
> I'd be interested in your .s comparison tool that decides which one
> is better!
> 

Well, yes, that's the hard part and it could be done a number of ways.

I think it would make a good contest actually, or summer coding
project.  Or at least a nice subject for brainstorming at the GNU
Cauldron. :)

Cheers,
Ian

Re: GCC driver to "Compile twice, score the assembly, choose the best"?

2014-05-15 Thread Andreas Schwab

"Ian Bolton"  writes:

> I was wondering if the "gcc" driver could be made to invoke
> "cc1" twice, with different flags, and then just keep the
> better of the two .s files that comes out?

How do you define "better"?

Andreas.

-- 
Andreas Schwab, SUSE Labs, sch...@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

RE: GCC driver to "Compile twice, score the assembly, choose the best"?

2014-05-15 Thread Paulo Matos



> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf
> Of Ian Bolton
> Sent: 15 May 2014 12:47
> To: gcc@gcc.gnu.org
> Subject: GCC driver to "Compile twice, score the assembly, choose the
> best"?
> 
> Hi, fellow GCC developers!
> 
> I was wondering if the "gcc" driver could be made to invoke "cc1"
> twice, with different flags, and then just keep the better of the two
> .s files that comes out?
> 
> I'm sure this is not a new idea, but I'm not aware of anything being
> done in this area, so I've made this post to gather your views. :)
> 
> The kinds of flags I am thinking could be toggled are register
> allocation and instruction scheduling ones, since it's very hard to
> find one-size-fits-all there and we don't really want to have the user
> depend on knowing the right one.
> 
> Obviously, compilation time will go up, but the run-time benefits
> could be huge.
> 
> What are your thoughts?  What work in this area have I failed to dig
> up in my limited research?

This looks like something that should be done outside of cc1. Other people have 
thought about it and what you suggest is exactly the example found in OpenTuner 
(http://opentuner.org/) announcement paper:
http://dspace.mit.edu/handle/1721.1/81958

The difference is that you don't compare .s files but instead choose the metric 
based on the execution of the program on a test bench.

HTH,

Paulo Matos

Re: Live range shrinkage in pre-reload scheduling

2014-05-15 Thread Vladimir Makarov

On 05/15/2014 02:46 AM, Ramana Radhakrishnan wrote:
> On Wed, May 14, 2014 at 5:38 PM, Richard Sandiford
>  wrote:
>> Vladimir Makarov  writes:
>>> On 2014-05-13, 6:27 AM, Kyrill Tkachov wrote:
 Hi all,

 In haifa-sched.c (in rank_for_schedule) I notice that live range
 shrinkage is not performed when SCHED_PRESSURE_MODEL is used and the
 comment mentions that it results in much worse code.

 Could anyone elaborate on this? Was it just empirically noticed on x86_64?

>>> It was empirically noticed on SPEC2000.  The practice is a single
>>> criteria for heuristic optimizations.  Sometimes a new heuristic
>>> optimization might look promising but the reality might be quite different.
> Vlad - Was that based on experiments on x86_64 ?
>
>
Yes, I benchmarked x86 and x86-64.  I believe this pass can help when we
don't use the 1st insn scheduler (that is x86/x86-64 case).  If the 1st
insn scheduler is profitable, I guess it is better to use
register-pressure insn scheduling than live-range shrinkage + insn
scheduling (even with register pressure).

Even for x86/x86-64 the improvement was quite small for live-range
shrinkage.  Therefore it is not a default optimization.  There are a set
of optimizations in GCC which can improve some cases worsen others in
practically equal number of cases.  They could be a good candidate for
machine-learning option choosing (e.g. MILEPOST project) because it is
hard to predict for human when they help (if it would be easy then we
could switch on these optimizations only for such cases).

I guess somebody could continue work on improving live-range shrinkage
on the scheduler code base.  May be there are better approaches to mine
here.

Re: Live Range Splitting in Integrated Register Allocator

2014-05-15 Thread Vladimir Makarov

On 05/15/2014 03:28 AM, Ajit Kumar Agarwal wrote:
>
> On 2014-05-14, 1:33 PM, Ajit Kumar Agarwal wrote:
>
>> Hello All:
>>
>> I am planning to implement the Live range splitting based on the following 
>> cases in the Integrated Register Allocator.
>>
>> For a given Live range that spans from  from outer region to inner region of 
>> the loop. Such Live ranges which are LiveIn at the entry of the header of 
>> the Loop and Live Out at the exit of the loop but there are no references 
>> inside the  Loop. Such Live ranges lead to unoptimal spill and fetch inside 
>> the Loop conflicting with the shorter live ranges that spans inside the Loop.
>>
>> Lets say such Live range as L1. L1 can be splitted at the Loop Boundary 
>> splitting the Live range by making a store at the header of the Loop and the 
>> Load at the exit of the Loop. This makes the Live range less conflicting 
>> with the Live ranges that are local to the Loop regions reducing the spill 
>> and Fetch inside the Loops.
>>
>>  From the code and documentation of Integrated Register Allocator following 
>> is the understanding.
>>
>> As Live range L1 is live in the outer region but as there are no reference 
>> inside the Loop region. Since the allocno for L1 for a given variable v is 
>> assigned two allocno v1 and v2 . V1 being assigned allocno for the outer 
>> region and v2 as allocno for the inner Loop region. This allows to 
>> accumulate the information from the inner loop region to outer region.
>>
>> Will the current Integrated Register Allocator will consider the Live range 
>> L1 as Live inside the Loop and outer region? If Yes then there will be 
>> conflicting with the Live ranges that are local to the Loop region leading 
>> to spill and fetch inside the Loop.  If the v1 and v2 allocno are created v1 
>> for the outer region and v2 for the inner region then there will v2 will be 
>> conflicting the local live ranges inside the Loop region and v1 will be 
>> conflicting with the Live ranges of the outer regions.  This is how its been 
>> considered as Live range splitting at the Loop Boundary for the Live range 
>> that spans inside the Loop but not not being referenced?
>>
>> If Such cases are not being considered in the Integrated Register Allocator, 
>> then it will be useful to implement such cases in IRA which will be 
>> benefitted the microblaze target.
>>
>> Please let me know what do you think.
>>
>>> Allocno v2 corresponding to live range inside the loop has a very small 
>>> cost for spilling therefore it will be spilled if we still need registers 
>>> to pseudos local to the loop.  If allocno v1 >>corresponding live ranges 
>>> outside the loop *and* inside the loop gets a hard register, we will have 
>>> live range splitting as you propose.  So I do not see a necessity for the 
>>> optimization >>you propose.
> If the allocno v1 does n't get the hard register and become the candidate of 
> spill which is Live that spans the Loop but not reference inside the Loop, it 
> will lead to spills inside the loop which become bottleneck as performance is 
> concerned. Does this case also handled in the IRA? With my proposal the load 
> and store will be moved outside the loop instead of having it inside the Loop 
> for such case.
>
If v1 does not get a hard register (and v2 most probably too), than
there will be no any additional insns because v1 and v2 most probably
gets the same memory.  If v1 gets a hard register and v2 does not, then
we have store on the loop enter and loads on the loop exits (the
store/loads are not inside the loop). 
>>> Moreover my experience shows that making a lot of explicit transformations 
>>> (e.g. proposed splitting) even if we have transformations to undo them 
>>> (e.g. coalescing) results in worse >>code. 
>>> The explicit transformations should be as minimal as possible during RA in 
>>> a good register allocator.  So I guess the optimization you propose will 
>>> actually results in a worse code.  >>Although I might be wrong because it 
>>> is always hard to predict the result of heuristic optimizations.
> There is a paper on Improved Passive Splitting which does take care of 
> Splitting at Loop boundaries for the Live range that spans the Loop but not 
> referenced.
>
> Improved Passive Splitting (2005) by Keith D. Cooper , Jason Eckhardt. This 
> paper uses the heuristics and claim to reduce the spill and fetch for the 
> case I propose. Also claims to generate the better code.
>
I read this article long ago (actually I worked with Jason for a few
years until he left us to Rice University post-graduated school). 
Usually heuristic optimizations (including RA) are very sensitive to
compiler environment and overall design.  So the effect of algorithm
could be quite different in different compiler environment.  Another
analogous approach is Peter Begner's one.  The both is based on
classical Chaitin-Briggs RA good for regular file architectures.  IRA is
different.  It is a regional RA (analogou

RE: Live Range Splitting in Integrated Register Allocator

2014-05-15 Thread Ajit Kumar Agarwal

Thanks Vladimir for the clarification.

Thanks & Regards
Ajit

-Original Message-
From: Vladimir Makarov [mailto:vmaka...@redhat.com] 
Sent: Thursday, May 15, 2014 8:39 PM
To: Ajit Kumar Agarwal; gcc@gcc.gnu.org
Cc: Michael Eager; Vinod Kathail; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Live Range Splitting in Integrated Register Allocator

On 05/15/2014 03:28 AM, Ajit Kumar Agarwal wrote:
>
> On 2014-05-14, 1:33 PM, Ajit Kumar Agarwal wrote:
>
>> Hello All:
>>
>> I am planning to implement the Live range splitting based on the following 
>> cases in the Integrated Register Allocator.
>>
>> For a given Live range that spans from  from outer region to inner region of 
>> the loop. Such Live ranges which are LiveIn at the entry of the header of 
>> the Loop and Live Out at the exit of the loop but there are no references 
>> inside the  Loop. Such Live ranges lead to unoptimal spill and fetch inside 
>> the Loop conflicting with the shorter live ranges that spans inside the Loop.
>>
>> Lets say such Live range as L1. L1 can be splitted at the Loop Boundary 
>> splitting the Live range by making a store at the header of the Loop and the 
>> Load at the exit of the Loop. This makes the Live range less conflicting 
>> with the Live ranges that are local to the Loop regions reducing the spill 
>> and Fetch inside the Loops.
>>
>>  From the code and documentation of Integrated Register Allocator following 
>> is the understanding.
>>
>> As Live range L1 is live in the outer region but as there are no reference 
>> inside the Loop region. Since the allocno for L1 for a given variable v is 
>> assigned two allocno v1 and v2 . V1 being assigned allocno for the outer 
>> region and v2 as allocno for the inner Loop region. This allows to 
>> accumulate the information from the inner loop region to outer region.
>>
>> Will the current Integrated Register Allocator will consider the Live range 
>> L1 as Live inside the Loop and outer region? If Yes then there will be 
>> conflicting with the Live ranges that are local to the Loop region leading 
>> to spill and fetch inside the Loop.  If the v1 and v2 allocno are created v1 
>> for the outer region and v2 for the inner region then there will v2 will be 
>> conflicting the local live ranges inside the Loop region and v1 will be 
>> conflicting with the Live ranges of the outer regions.  This is how its been 
>> considered as Live range splitting at the Loop Boundary for the Live range 
>> that spans inside the Loop but not not being referenced?
>>
>> If Such cases are not being considered in the Integrated Register Allocator, 
>> then it will be useful to implement such cases in IRA which will be 
>> benefitted the microblaze target.
>>
>> Please let me know what do you think.
>>
>>> Allocno v2 corresponding to live range inside the loop has a very small 
>>> cost for spilling therefore it will be spilled if we still need registers 
>>> to pseudos local to the loop.  If allocno v1 >>corresponding live ranges 
>>> outside the loop *and* inside the loop gets a hard register, we will have 
>>> live range splitting as you propose.  So I do not see a necessity for the 
>>> optimization >>you propose.
> If the allocno v1 does n't get the hard register and become the candidate of 
> spill which is Live that spans the Loop but not reference inside the Loop, it 
> will lead to spills inside the loop which become bottleneck as performance is 
> concerned. Does this case also handled in the IRA? With my proposal the load 
> and store will be moved outside the loop instead of having it inside the Loop 
> for such case.
>
If v1 does not get a hard register (and v2 most probably too), than there will 
be no any additional insns because v1 and v2 most probably gets the same 
memory.  If v1 gets a hard register and v2 does not, then we have store on the 
loop enter and loads on the loop exits (the store/loads are not inside the 
loop). 
>>> Moreover my experience shows that making a lot of explicit transformations 
>>> (e.g. proposed splitting) even if we have transformations to undo them 
>>> (e.g. coalescing) results in worse >>code. 
>>> The explicit transformations should be as minimal as possible during RA in 
>>> a good register allocator.  So I guess the optimization you propose will 
>>> actually results in a worse code.  >>Although I might be wrong because it 
>>> is always hard to predict the result of heuristic optimizations.
> There is a paper on Improved Passive Splitting which does take care of 
> Splitting at Loop boundaries for the Live range that spans the Loop but not 
> referenced.
>
> Improved Passive Splitting (2005) by Keith D. Cooper , Jason Eckhardt. This 
> paper uses the heuristics and claim to reduce the spill and fetch for the 
> case I propose. Also claims to generate the better code.
>
I read this article long ago (actually I worked with Jason for a few years 
until he left us to Rice University post-graduated school). 
Usually heuristic op

Re: [GSoC] writing test-case

2014-05-15 Thread Andreas Schwab

Prathamesh Kulkarni  writes:

> +/* { dg-final { scan-tree-dump "gimple_match_and_simplified to \[^\n\r\]*= 
> x_\\d\+\\(D\\) - y_\\d\+\\(D\\)" "forwprop1" } } */

No need to quote +, it's not special to tcl.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

gcc-4.8-20140515 is now available

2014-05-15 Thread gccadmin

Snapshot gcc-4.8-20140515 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/4.8-20140515/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.8 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_8-branch 
revision 210479

You'll find:

 gcc-4.8-20140515.tar.bz2 Complete GCC

  MD5=5a0c9bb6eda9ba9dcc2253b0654fd7dc
  SHA1=64b835f56bce6f86c278b81b48f86c6b0ec0c9b1

Diffs from 4.8-20140508 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.8
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.

Re: Live range shrinkage in pre-reload scheduling

RE: Live Range Splitting in Integrated Register Allocator

Re: Live range shrinkage in pre-reload scheduling

Obscure iostream failure, a compiler bug or a language design issue?

Re: problem with fprintf and OpenMP

Re: Obscure iostream failure, a compiler bug or a language design issue?

Re: problem with fprintf and OpenMP

Re: [GSoC] writing test-case

RE: Why is this not optimized?

Re: Live range shrinkage in pre-reload scheduling

Re: [GSoC] writing test-case

Re: [GSoC] writing test-case

GCC 4.8.3 Status Report, branch frozen for release (2014-05-15)

GCC 4.8.3 Release Candidate available from gcc.gnu.org

Re: GCC 4.8.3 Status Report, branch frozen for release (2014-05-15)

GCC driver to "Compile twice, score the assembly, choose the best"?

Re: GCC driver to "Compile twice, score the assembly, choose the best"?

RE: GCC driver to "Compile twice, score the assembly, choose the best"?

Re: GCC driver to "Compile twice, score the assembly, choose the best"?

RE: GCC driver to "Compile twice, score the assembly, choose the best"?

Re: Live range shrinkage in pre-reload scheduling

Re: Live Range Splitting in Integrated Register Allocator

RE: Live Range Splitting in Integrated Register Allocator

Re: [GSoC] writing test-case

gcc-4.8-20140515 is now available

25 matches

Site Navigation

Mail list logo

Footer information