Re: Live range shrinkage in pre-reload scheduling
Ramana Radhakrishnan writes: > On Wed, May 14, 2014 at 5:38 PM, Richard Sandiford > wrote: >> Hey, I resent that. You make it sound I came up with SCHED_PRESSURE_MODEL >> on a whim without any evidence to back it up. I implemented it because >> it gave better EEMBC results on ARM, at least at the time that I wrote >> it, and it didn't effect SPEC2000 for ARM much one way or the other. >> It also produced better results for s390x on SPEC2006 at the time it >> was tested, which is why it was turned on by default there too. >> >> For anyone interested in the background and rationale, the original >> posting was here: https://gcc.gnu.org/ml/gcc-patches/2011-12/msg01684.html > > I no longer have those results, I don't have mine either unfortunately. > the reason we turned them on were > because IIRC there were significant improvements on A8 and A9 for this > new weighted algorithm and it seemed to work well. > >> >> I'm not claiming it's a great heuristic or anything. There's bound to >> be room for improvement. But it was based on "reality" and real results. >> >> Of course, if it turns out not be a win for ARM or s390x any more then it >> should be disabled. > > The current situation that Kyrill is investigating is a case where we > notice the first scheduler pass being a bit too aggressive with > creating ILP opportunities with the A15 scheduler that causes > performance differences with not turning on the first scheduler pass > vs using the defaults. Ah, OK. "model" was a deliberate attempt to be less conservative than "weighted", but it sounds like it went too far in your A15 case. Have you got a testcase you can share? >>> In this relation I am remembering a story told me by Bob Morgan about >>> bin packing RA invention. It was just a quick and simple first RA >>> implementation for a new compiler. After that DEC compiler team tried >>> many times to improve the RA implementing more complicated optimizations >>> but the first bin packing RA was always better. >> >> You make it sound like your original -fsched-pressure is unlikely >> to be beaten, in the way that you think bin packing wasn't beaten. >> But both versions of -fsched-pressure are off by default on most >> targets for a reason. (AFAIK the only two targets that enable it by >> default are the two that use SCHED_PRESSURE_MODEL: arm and s390x.) >> I think this is still an area that could be improved. I don't mind >> whether that's through improving one of the two existing heuristics >> or doing something different, but it seems pessimistic to say that >> scheduling based on register pressure is always going to be the optional >> feature that it is now. >> >> E.g. tracking pressure classes isn't always the right thing for >> targets like PowerPC where only part of the vector register set >> can be used for floating-point operations. > > Is there another post that deals with this particular case ? I tried > digging through the archives but couldn't find anything easily enough. Not sure I ever sent it in the end. I got pulled off Linaro before I could properly finish this stuff off. But IIRC the problem was that both algorithms use IRA pressure classes to calculate the pressure. If VSX is enabled then the pressure class for both scalar floating-point and vector operations will be VSX_REGS. But the scalar operations can only use half of those registers. So in float-heavy code, the heuristics would assume that there are twice as many registers available as there really are. This hurt SCHED_PRESSURE_MODEL much more than SCHED_PRESSURE_WEIGHTED because *_MODEL was supposed to be less conservative. I.e. *_MODEL would encourage ILP to the point of using twice the number of actual registers. IIRC the biggest reason that this didn't affect *_WEIGHTED as much was that it deliberately treated a register as being live indefinitely if the register is used more than once. This is the thing I mentioned in the write-up that Vlad said had also improved x86_64. Thanks, Richard
RE: Live Range Splitting in Integrated Register Allocator
On 2014-05-14, 1:33 PM, Ajit Kumar Agarwal wrote: > > Hello All: > > I am planning to implement the Live range splitting based on the following > cases in the Integrated Register Allocator. > > For a given Live range that spans from from outer region to inner region of > the loop. Such Live ranges which are LiveIn at the entry of the header of the > Loop and Live Out at the exit of the loop but there are no references inside > the Loop. Such Live ranges lead to unoptimal spill and fetch inside the Loop > conflicting with the shorter live ranges that spans inside the Loop. > > Lets say such Live range as L1. L1 can be splitted at the Loop Boundary > splitting the Live range by making a store at the header of the Loop and the > Load at the exit of the Loop. This makes the Live range less conflicting with > the Live ranges that are local to the Loop regions reducing the spill and > Fetch inside the Loops. > > From the code and documentation of Integrated Register Allocator following > is the understanding. > > As Live range L1 is live in the outer region but as there are no reference > inside the Loop region. Since the allocno for L1 for a given variable v is > assigned two allocno v1 and v2 . V1 being assigned allocno for the outer > region and v2 as allocno for the inner Loop region. This allows to accumulate > the information from the inner loop region to outer region. > > Will the current Integrated Register Allocator will consider the Live range > L1 as Live inside the Loop and outer region? If Yes then there will be > conflicting with the Live ranges that are local to the Loop region leading to > spill and fetch inside the Loop. If the v1 and v2 allocno are created v1 for > the outer region and v2 for the inner region then there will v2 will be > conflicting the local live ranges inside the Loop region and v1 will be > conflicting with the Live ranges of the outer regions. This is how its been > considered as Live range splitting at the Loop Boundary for the Live range > that spans inside the Loop but not not being referenced? > > If Such cases are not being considered in the Integrated Register Allocator, > then it will be useful to implement such cases in IRA which will be > benefitted the microblaze target. > > Please let me know what do you think. > >>Allocno v2 corresponding to live range inside the loop has a very small cost >>for spilling therefore it will be spilled if we still need registers to >>pseudos local to the loop. If allocno v1 >>corresponding live ranges outside >>the loop *and* inside the loop gets a hard register, we will have live range >>splitting as you propose. So I do not see a necessity for the optimization you propose. If the allocno v1 does n't get the hard register and become the candidate of spill which is Live that spans the Loop but not reference inside the Loop, it will lead to spills inside the loop which become bottleneck as performance is concerned. Does this case also handled in the IRA? With my proposal the load and store will be moved outside the loop instead of having it inside the Loop for such case. >>Moreover my experience shows that making a lot of explicit transformations >>(e.g. proposed splitting) even if we have transformations to undo them (e.g. >>coalescing) results in worse >>code. >>The explicit transformations should be as minimal as possible during RA in a >>good register allocator. So I guess the optimization you propose will >>actually results in a worse code. >>Although I might be wrong because it is >>always hard to predict the result of heuristic optimizations. There is a paper on Improved Passive Splitting which does take care of Splitting at Loop boundaries for the Live range that spans the Loop but not referenced. Improved Passive Splitting (2005) by Keith D. Cooper , Jason Eckhardt. This paper uses the heuristics and claim to reduce the spill and fetch for the case I propose. Also claims to generate the better code. >>What is really missed in RA, it is a good splitting on BB boundaries. I have >>plans to try this as a part of more common pass to decrease register pressure >>on which I'll start work soon. >>In any way thanks for the proposal. You are free try it to confirm or reject >>my prediction. Unfortunately, that is the only way to be sure about the >>result. Thanks & Regards Ajit
Re: Live range shrinkage in pre-reload scheduling
On May 15, 2014, at 6:46 PM, Ramana Radhakrishnan wrote: > >> >> I'm not claiming it's a great heuristic or anything. There's bound to >> be room for improvement. But it was based on "reality" and real results. >> >> Of course, if it turns out not be a win for ARM or s390x any more then it >> should be disabled. > > The current situation that Kyrill is investigating is a case where we > notice the first scheduler pass being a bit too aggressive with > creating ILP opportunities with the A15 scheduler that causes > performance differences with not turning on the first scheduler pass > vs using the defaults. Charles has a work-in-progress patch that fixes a bug in SCHED_PRESSURE_MODEL that causes the above symptoms. The bug causes 1st scheduler to unnecessarily increase live ranges of pseudo registers when there are a lot of instructions in the ready list. Charles, can you finish your patch in the next several days and post it for review? Thank you, -- Maxim Kuvyrkov www.linaro.org
Obscure iostream failure, a compiler bug or a language design issue?
Hello! I had a C++ program that crashed, so I added debug output using clog. Unfortunately the first out put to clog also crashed within iostreams. I was able to make a test case that compiles fine with -Wall -Wextra, but crashes: ---ugly code follows--- class d { public: static void f(); d() { f(); } }; d dd; #include using namespace std; void d::f() { clog << "jdfdf" << endl; } int main(int argc, char *argv[]) { d::f(); } --- If you #include before the decalration of "dd", the program works; if compiled as displayed, it crashes: --- Program received signal SIGSEGV, Segmentation fault. 0x77b6f501 in std::ostream::sentry::sentry(std::ostream&) () from /usr/lib64/libstdc++.so.6 (gdb) bt #0 0x77b6f501 in std::ostream::sentry::sentry(std::ostream&) () from /usr/lib64/libstdc++.so.6 #1 0x77b6fc19 in std::basic_ostream >& std::__ostream_insert >(std::basic_ostream >&, char const*, long) () from /usr/lib64/libstdc++.so.6 #2 0x77b7001f in std::basic_ostream >& std::operator<< >(std::basic_ostream >&, char const*) () from /usr/lib64/libstdc++.so.6 #3 0x00400942 in d::f () at t.cc:11 #4 0x00400a6a in d::d (this=0x6011a0 ) at t.cc:4 #5 0x004009d9 in __static_initialization_and_destruction_0 ( __initialize_p=1, __priority=65535) at t.cc:7 #6 0x00400a33 in global constructors keyed to dd() () at t.cc:15 #7 0x00400b46 in __do_global_ctors_aux () #8 0x00400783 in _init () #9 0x772dcad8 in ?? () from /lib64/libc.so.6 #10 0x00400ad5 in __libc_csu_init (argc=-7936, argv=0x601080 <_ZSt4clog@@GLIBCXX_3.4>, envp=0x0) at elf-init.c:120 #11 0x772ebbc2 in __libc_start_main () from /lib64/libc.so.6 #12 0x00400859 in _start () at ../sysdeps/x86_64/elf/start.S:113 --- gcc is "gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux" of SLES11 SP3 (x86_64), but newer compilers react the same. I guess the problem is that constructors are called sequentially without dependency analysis. So if d() is acaaled before including iostream (lexically), the program crashes. Reagrds, Ulrich Windl P.S. Usually I don't write such ugly code
Re: problem with fprintf and OpenMP
On 15.05.2014 08:01, Siegmar Gross wrote: #include #include #include int main (void) { #pragma omp parallel default (none) fprintf (stderr, "Hello!\n"); return EXIT_SUCCESS; } I get the following error, when I try to compile the program on "Solaris 10 Sparc". tyr OpenMP 116 \gcc -fopenmp omp_fprintf.c In file included from /usr/include/stdio.h:66:0, from omp_fprintf.c:38: omp_fprintf.c: In function 'main': omp_fprintf.c:45:12: error: '__iob' not specified in enclosing parallel fprintf (stderr, "Hello!\n"); ^ omp_fprintf.c:44:11: error: enclosing parallel #pragma omp parallel default (none) ^ tyr OpenMP 117 I can solve the problem if I add "shared(__iob)" to "#pragma", but then the program isn't portable any longer. tyr OpenMP 118 grep stderr /usr/include/iso/stdio_iso.h #define stderr (&__iob[2]) #define stderr (&_iob[2]) tyr OpenMP 119 I have a similar problem with Linux and Mac OS X, which need a different solution, because "stderr" is defined in a different way. To get around this, you can add FILE *fp=stderr; and use fp instead of stderr, also adding shared(fp). -leif
Re: Obscure iostream failure, a compiler bug or a language design issue?
On 05/15/2014 09:58 AM, Ulrich Windl wrote: I guess the problem is that constructors are called sequentially without dependency analysis. So if d() is acaaled before including iostream (lexically), the program crashes. Yes, that's right. The C++ mechanism for initialization of global variables is fairly primitive. I'm not sure if GCC can do much about that because it's more or less what programmers expect, and more reliable mechanisms would have a performance impact (as long as dynamic linking with independent recompilation is supported). (By the way, such questions are more appropriate for the gcc-help mailing list.) -- Florian Weimer / Red Hat Product Security Team
Re: problem with fprintf and OpenMP
On Thu, May 15, 2014 at 08:01:13AM +0200, Siegmar Gross wrote: > I'm using gcc-4.9.0 and have a problem with the following program. > I have reported the problem allready on gcc-help some days ago, > but didn't get any replies. Perhaps somebody in this list knows, > if the behaviour is intended. Yes, this is what OpenMP standard requires. > #include > #include > #include > > int main (void) > { > #pragma omp parallel default (none) > fprintf (stderr, "Hello!\n"); > return EXIT_SUCCESS; > } > > > I can solve the problem if I add "shared(__iob)" to "#pragma", > but then the program isn't portable any longer. Or don't use default (none), or don't use stderr directly in the default(none) parallel region (use a temporary variable, call a function that implicitly uses stderr, etc.). > In my opinion the compiler knows that I use OpenMP and that > I use "default(none)" so that it can automatically add all > necessary internal variables to the "shared clause" or that Please define what is internal variable. __iob is for the compiler not an internal variable of any kind, stderr is a macro like any other. Note that e.g. on Linux stderr is a macro and a variable (#define stderr stderr), treating in that case stderr as an internal variable would be really weird. Jakub
Re: [GSoC] writing test-case
On Thu, May 15, 2014 at 12:30 AM, Prathamesh Kulkarni wrote: > On Wed, May 14, 2014 at 3:54 PM, Richard Biener > wrote: >> On Tue, May 13, 2014 at 11:06 PM, Prathamesh Kulkarni >> wrote: >>> On Tue, May 13, 2014 at 2:36 PM, Richard Biener >>> wrote: On Sun, May 11, 2014 at 5:00 PM, Prathamesh Kulkarni wrote: > On Sun, May 11, 2014 at 8:10 PM, Andreas Schwab > wrote: >> Prathamesh Kulkarni writes: >> >>> a) I am not able to follow why 3 slashes are required here >>> in x_.\\\(D\\\) ? Why does x_.\(D\) not work ? >> >> Two of the three backslashes are eaten by the tcl parser. But actually >> only two backslashes are needed, since the parens are not special to tcl >> (but are special to the regexp engine, so you want a single backslash >> surviving the tcl parser). >> >>> b) The expression after folding would be of the form: >>> t2_ = x_(D) - y_(D) >>> I have used the operator "." in the pattern to match digit. >>> While that works in the above case, I think a better >>> idea would be to match using [0-9]. >>> I tried the following but it does not work: >>> t_[0-9] = x_[0-9]\\\(D\\\) - y_[0-9]\\\(D\\\) >>> Neither does \\\[ and \\\] work. >> >> Brackets are special in tcl (including inside double quotes), so they >> need to be quoted. But you want the brackets to appear unquoted to the >> regexp engine, so a single backslash will do the Right Thing. >> >> See tcl(n) for the tcl parsing rules. >> > Thanks. Now I get it, the double backslash \\ is an escape sequence > for \, and special characters like (, [ > retain their meaning in quotes, so to match input text: (D), the > pattern has to be written as: "\\(D\\)". > I believe "\(D\)" would only match D in the input ? > I have modified the test-case. Is this version correct ? I usually verify that by running the testcase in isolation on a GCC version that should FAIL it and on one that should PASS it (tcl quoting is also try-and-error for me most of the time...). Thus I do gcc/> make check-gcc RUNTESTFLAGS="tree-ssa.exp=match-2.c" gcc/> make cc1 ... compiles cc1 ... gcc/> make check-gcc RUNTESTFLAGS="tree-ssa.exp=match-2.c" A more complete matching for an SSA name would be (allowing for SSA name versions > 9) _\\d\+ with \\(D\\) appended if suitable (that's usually known from the testcase). \\d\+ should match at least one decimal digit. >>> I thought that SSA name version wouldn't exceed 9 for that test-case, >>> so I decided for matching only one digit. I will change it to match >>> one or more digits. >>> >>> * I have written test-cases for patterns in match.pd (attached patch), which >>> result in PASS. Could you review them for me ? >>> I couldn't write for following ones: >>> >>> 1] Patterns involving COMPLEX_EXPR, REALPART_EXPR, IMAGPART_EXPR >>> (match_and_simplify >>> (complex (realpart @0) (imagpart @0)) >>> @0) >>> (match_and_simplify >>> (realpart (complex @0 @1)) >>> @0) >>> (match_and_simplify >>> (imagpart (complex @0 @1)) >>> @1) >>> >>> Sorry to be daft, but I couldn't understand what these patterns meant >>> (I know complex numbers). >>> Could you give an example that matches one of these patterns ? >>> Thanks. >> >> The existing match-1.c testcase has some ideas. For the first >> pattern I'd do >> >> _Complex double foo (_Complex double z) >> { >> double r = __real z; >> double i = __imag z; >> return r + 1.0iF * i; >> } >> >> where the return expression is folded (yeah ...) to a COMPLEX_EXPR. >> >> For the other two patterns sth like >> >> double foo (double r) >> { >> _Complex double z = r; >> return __real z; >> } >> >> and >> >> double foo (double i) >> { >> _Complex double z = 1.0iF * i; >> return __imag z; >> } >> >> should work. >> > Thanks. Now I understood the meaning of patterns. > The first pattern should return z instead of returning a new complex > number from r and i. > however the test-case doesn't appear to work. It probably needs building with -ffast-math, the building of the COMPLEX_EXPR with r + 1.0iF * i triggers interesting corner-cases otherwise. An alternative is to write _Complex double foo (_Complex double z) { double r = __real z; double i = __imag z; _Complex double tem; __real tem = r; __imag tem = i; return tem; } > The other two transforms real (complex x) -> real x and imag (complex > x) -> imag x were simplified. > >>> 2] Test-case for FMA_EXPR. I am not sure how to generate FMA_EXPR from C >>> code. >>> (match_and_simplify >>> (fma INTEGER_CST_P@0 INTEGER_CST_P@1 @3) >>> (plus (mult @0 @1) @3)) >> >> I believe it's not possible. FMA is matched by the optimize_widen_mult >> pass which runs quite late, after the last forwprop pass. So I don't think >> it's possible to write a testcase that triggers with the existing c
RE: Why is this not optimized?
Thanks for the reply. I will look at the patch. As far as the cost is concerned, I think fwprop doesn't really need to understand pipeline model. As long as rtx costs after optimization is less than before optimization, I think it is good enough. Of course, it won't be better in every case, but should be better in general. Cheers, Bingfeng -Original Message- From: Bin.Cheng [mailto:amker.ch...@gmail.com] Sent: 15 May 2014 06:59 To: Bingfeng Mei Cc: gcc@gcc.gnu.org Subject: Re: Why is this not optimized? On Wed, May 14, 2014 at 9:14 PM, Bingfeng Mei wrote: > Hi, > I am looking at some code of our target, which is not optimized as expected. > For the following RTX, I expect source of insn 17 should be propagated into > insn 20, and insn 17 is eliminated as a result. On our target, it will become > a predicated xor instruction instead of two. Initially, I thought fwprop pass > should do this. > > (insn 17 16 18 3 (set (reg/v:HI 102 [ crc ]) > (xor:HI (reg/v:HI 108 [ crc ]) > (const_int 16386 [0x4002]))) coremark.c:1632 725 {xorhi3} > (nil)) > (insn 18 17 19 3 (set (reg:BI 113) > (ne:BI (reg:QI 101 [ D.4446 ]) > (const_int 1 [0x1]))) 1397 {cmp_qimode} > (nil)) > (jump_insn 19 18 55 3 (set (pc) > (if_then_else (ne (reg:BI 113) > (const_int 0 [0])) > (label_ref 23) > (pc))) 1477 {cbranchbi4} > (expr_list:REG_DEAD (reg:BI 113) > (expr_list:REG_BR_PROB (const_int 7100 [0x1bbc]) > (expr_list:REG_PRED_WIDTH (const_int 1 [0x1]) > (nil > -> 23) > (note 55 19 20 4 [bb 4] NOTE_INSN_BASIC_BLOCK) > (insn 20 55 23 4 (set (reg:HI 112 [ crc ]) > (reg/v:HI 102 [ crc ])) 502 {fp_movhi} > (expr_list:REG_DEAD (reg/v:HI 102 [ crc ]) > (nil))) > (code_label 23 20 56 5 2 "" [1 uses]) > > > But it can't. First propagate_rtx_1 will return false because PR_CAN_APPEAR > is false and > following code is executed. > > if (x == old_rtx) > { > *px = new_rtx; > return can_appear; > } > > Even I forces PR_CAN_APPEAR to be set in flags, fwprop still won't go ahead in > try_fwprpp_subst because old_cost is 0 (REG only rtx), and set_src_cost > (SET_SRC (set), > speed) is bigger than 0. So the change is deemed as not profitable, which is > not correct > IMO. Pass fwprop is too conservative with respect to propagation opportunities outside of memory reference, it just gives up at many places. Also as in your case, seems it doesn't take into consideration that the original insn can be removed after propagation. We Mi once sent a patch re-implementing fwprop pass at https://gcc.gnu.org/ml/gcc-patches/2013-03/msg00617.html . I also did some experiments and worked out a local patch doing similar work to handle cases exactly like yours. The problem is even though one instruction can be saved (as in your case), it's not always good, because it tends to generate more complex instructions, and such insns are somehow more vulnerable to pipeline hazard. Unfortunately, it's kind of impossible for fwprop to understand the pipeline risk. Thanks, bin > > If fwprop is not the place to do this optimization, where should it be done? > I am working on up-to-date GCC 4.8. > > Thanks, > Bingfeng Mei -- Best Regards.
Re: Live range shrinkage in pre-reload scheduling
On Thu, May 15, 2014 at 8:36 AM, Maxim Kuvyrkov wrote: > On May 15, 2014, at 6:46 PM, Ramana Radhakrishnan > wrote: >> >>> >>> I'm not claiming it's a great heuristic or anything. There's bound to >>> be room for improvement. But it was based on "reality" and real results. >>> >>> Of course, if it turns out not be a win for ARM or s390x any more then it >>> should be disabled. >> >> The current situation that Kyrill is investigating is a case where we >> notice the first scheduler pass being a bit too aggressive with >> creating ILP opportunities with the A15 scheduler that causes >> performance differences with not turning on the first scheduler pass >> vs using the defaults. > > Charles has a work-in-progress patch that fixes a bug in SCHED_PRESSURE_MODEL > that causes the above symptoms. The bug causes 1st scheduler to > unnecessarily increase live ranges of pseudo registers when there are a lot > of instructions in the ready list. Is this something that you've seen shows up in general integer code as well ? Do you or Charles have an example for us to look at ? I'm not sure what "lot of instructions in the ready list" really means here. The specific case Kyrill's been looking into is Dhrystone Proc_8 when tuned for a Cortex-A15 with neon and float-abi=hard but I am not sure if that has "too many instructions" :) . Kyrill, could you also look into the other cases we have from SPEC2k where we see this as well and come back with any specific testcases that Charles / Richard could also take a look into. > > Charles, can you finish your patch in the next several days and post it for > review? I think we'll await this but we'll go look into some of the benchmarks. Ramana > > Thank you, > > -- > Maxim Kuvyrkov > www.linaro.org > >
Re: [GSoC] writing test-case
>> So I came along the need to add another predicate for REAL_CST >> leafs which makes me wonder if we should support tree codes >> as predicates. Thus instead of writing >> >> (match_and_simplify >> (plus (plus @0 INTEGER_CST_P@1) INTEGER_CST_P@2) >> (plus @0 (plus @1 @2))) >> >> write >> >> (match_and_simplify >> (plus (plus @0 INTEGER_CST@1) INTEGER_CST@2) >> (plus @0 (plus @1 @2))) >> >> we have conveniently choosen lower-case without _EXPR for >> expression codes thus the uppercase original form is available >> to be used as pre-defined predicate (code generated is >> TREE_CODE (x) == ). >> >> Just an idea ... > It appears to be good. However would we want predicates that are > slightly more complex ? > For example, integer_onep @0 ? In principle predicates are not different from expressions: (plus @0 INTEGER_CST) is similar to (plus @0 (INTEGER_CST)) that is, if INTEGER_CST is treated as zero-operand "expression". > I was wondering if it would be a good idea to make predicate an > attribute of operand (expr, capture) > rather than keep predicate as a separate operand ? (since predicates > are only used for checking the operand they are > applied to). > > The hierarchy be something like: > > struct predicate_operand: public operand > { > predicate_operand (const char *ident_, enum operand::type ty) : > operand (ty), ident (ident_) {} > const char *ident; > virtual void gen_gimple_match (FILE *f, const char *, const char *); > virtual void gen_gimple_transform (FILE *, const char *) = 0; > }; > > struct capture : public predicate_operand > { > capture (const char *where_, operand *what_, const char *pred = 0) > : predicate_operand (pred, OP_CAPTURE), where (where_), what (what_) {} > const char *where; > operand *what; > virtual void gen_gimple_match (FILE *f, const char *, const char *); > virtual void gen_gimple_transform (FILE *f, const char *); > }; > And make capture::gen_gimple_match call > predicate_operand::gen_gimple_match if ident is non-null. > similarly for expr. I don't understand. predicate is already derived from operand, so is capture. How would you write (plus @0 integer_zerop) differently? Here 'integer_zerop' is an operand of the plus expression. To me predicate (and capture without expression or predicate) differs from expression in that predicate is clearly a leaf of the expression tree while we have to recurse into expression operands. Now, if we want to support applying predicates to the midst of an expression, like (plus predicate(minus @0 @1) @2) (...) then this would no longer be true. At the moment you'd write (plus (minus@3 @0 @1) @2) if (predicate (@3)) (...) which makes it clearer IMHO (with the decision tree building you'd apply the predicates after matching the expression tree anyway I suppose, so code generation would be equivalent). One exception may be types of (sub-)expressions which we can also check cheaply (and eventually build into the decision tree) (cheaply == doesn't involve function calls). So for the if-expressions like (match_and_simplify (plus @0 (negate @1)) if (!TYPE_SATURATING (TREE_TYPE (@0))) (minus @0 @1)) we'd ideally want to group all patterns with the same type-related if-expression to improve the matching process. That eventually asks for sth like if (!TYPE_SATURATING (TREE_TYPE (@0))) { (match_and_simplify (plus @0 (negate @1)) (minus @0 @1)) (match_and_simplify (plus (negate @0) @1) (minus @1 @0) } thus have multiple patterns with a common "outer" if-expression. If you look into fold-const.c we have this kind of grouping used there to group for example cases valid only for !FLOAT_TYPE_P or INTEGRAL_TYPE_P. IMHO we can think about this when the individual if-expressions become too many. I don't like the idea of applying predicates to non-leafs in the match expression itself. Richard.
Re: [GSoC] writing test-case
>>> * I have written test-cases for patterns in match.pd (attached patch), which >>> result in PASS. Could you review them for me ? >> >> Sure. It looks good to me, though you can look at the changed match-1.c >> testcase on the branch where I've changed the matching to look for the >> debug output the forwprop pass dumps with -fdump-tree-forwprop1-details, >> that makes sure no other pass before did the transform (you can also >> move the individual dg-final lines after the testcase function to more >> easily associate them with a function). > Thanks, modified the patch to scan for "gimple_match_and_simplified" instead. >> >> At some point the testcase should be split up as well. >> >> How do you manage your sources at the moment? Just a svn >> checkout of the branch with local modifications? > Yes. Ok. I have checked in the patch with the following changelog entry (applied to gcc/ChangeLog.mas): 2014-05-15 Prathamesh Kulkarni * match.pd: Add more cases from tree-ssa-forwprop.c. testsuite/ * gcc.dg/tree-ssa/match-2.c: New testcase. Richard. >> >> Thanks, >> Richard.
GCC 4.8.3 Status Report, branch frozen for release (2014-05-15)
Status == The 4.8 branch is now frozen as I am preparing a first release candidate for 4.8.3. All patches to the branch now require explicit approval from release managers. Previous Report === https://gcc.gnu.org/ml/gcc/2014-05/msg00026.html
GCC 4.8.3 Release Candidate available from gcc.gnu.org
GCC 4.8.3 Release Candidate available from gcc.gnu.org The first release candidate for GCC 4.8.3 is available from ftp://gcc.gnu.org/pub/gcc/snapshots/4.8.3-RC-20140515 and shortly its mirrors. It has been generated from SVN revision 210453. I have so far bootstrapped and tested the release candidate on x86_64-linux. Please test it and report any issues to bugzilla. If all goes well, I'd like to release 4.8.3 on Friday, 23th.
Re: GCC 4.8.3 Status Report, branch frozen for release (2014-05-15)
On Thu, May 15, 2014 at 5:27 AM, Richard Biener wrote: > > Status > == > > The 4.8 branch is now frozen as I am preparing a first release > candidate for 4.8.3. All patches to the branch now require > explicit approval from release managers. Please hold off on GCC 4.8.3. powerpc-linux has a potentially serious ABI incompatibility issue related to the HTM support backported to the 4.8 branch. Thanks, David
GCC driver to "Compile twice, score the assembly, choose the best"?
Hi, fellow GCC developers! I was wondering if the "gcc" driver could be made to invoke "cc1" twice, with different flags, and then just keep the better of the two .s files that comes out? I'm sure this is not a new idea, but I'm not aware of anything being done in this area, so I've made this post to gather your views. :) The kinds of flags I am thinking could be toggled are register allocation and instruction scheduling ones, since it's very hard to find one-size-fits-all there and we don't really want to have the user depend on knowing the right one. Obviously, compilation time will go up, but the run-time benefits could be huge. What are your thoughts? What work in this area have I failed to dig up in my limited research? Many thanks, Ian
Re: GCC driver to "Compile twice, score the assembly, choose the best"?
On Thu, May 15, 2014 at 1:46 PM, Ian Bolton wrote: > Hi, fellow GCC developers! > > I was wondering if the "gcc" driver could be made to invoke > "cc1" twice, with different flags, and then just keep the > better of the two .s files that comes out? I'd be interested in your .s comparison tool that decides which one is better! Richard. > I'm sure this is not a new idea, but I'm not aware of > anything being done in this area, so I've made this post to > gather your views. :) > > The kinds of flags I am thinking could be toggled are > register allocation and instruction scheduling ones, since > it's very hard to find one-size-fits-all there and we don't > really want to have the user depend on knowing the right > one. > > Obviously, compilation time will go up, but the run-time > benefits could be huge. > > What are your thoughts? What work in this area have I > failed to dig up in my limited research? > > Many thanks, > Ian > > > >
RE: GCC driver to "Compile twice, score the assembly, choose the best"?
Thanks for the quick response. > On Thu, May 15, 2014 at 1:46 PM, Ian Bolton wrote: > > Hi, fellow GCC developers! > > > > I was wondering if the "gcc" driver could be made to invoke > > "cc1" twice, with different flags, and then just keep the > > better of the two .s files that comes out? > > I'd be interested in your .s comparison tool that decides which one > is better! > Well, yes, that's the hard part and it could be done a number of ways. I think it would make a good contest actually, or summer coding project. Or at least a nice subject for brainstorming at the GNU Cauldron. :) Cheers, Ian
Re: GCC driver to "Compile twice, score the assembly, choose the best"?
"Ian Bolton" writes: > I was wondering if the "gcc" driver could be made to invoke > "cc1" twice, with different flags, and then just keep the > better of the two .s files that comes out? How do you define "better"? Andreas. -- Andreas Schwab, SUSE Labs, sch...@suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different."
RE: GCC driver to "Compile twice, score the assembly, choose the best"?
> -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf > Of Ian Bolton > Sent: 15 May 2014 12:47 > To: gcc@gcc.gnu.org > Subject: GCC driver to "Compile twice, score the assembly, choose the > best"? > > Hi, fellow GCC developers! > > I was wondering if the "gcc" driver could be made to invoke "cc1" > twice, with different flags, and then just keep the better of the two > .s files that comes out? > > I'm sure this is not a new idea, but I'm not aware of anything being > done in this area, so I've made this post to gather your views. :) > > The kinds of flags I am thinking could be toggled are register > allocation and instruction scheduling ones, since it's very hard to > find one-size-fits-all there and we don't really want to have the user > depend on knowing the right one. > > Obviously, compilation time will go up, but the run-time benefits > could be huge. > > What are your thoughts? What work in this area have I failed to dig > up in my limited research? This looks like something that should be done outside of cc1. Other people have thought about it and what you suggest is exactly the example found in OpenTuner (http://opentuner.org/) announcement paper: http://dspace.mit.edu/handle/1721.1/81958 The difference is that you don't compare .s files but instead choose the metric based on the execution of the program on a test bench. HTH, Paulo Matos
Re: Live range shrinkage in pre-reload scheduling
On 05/15/2014 02:46 AM, Ramana Radhakrishnan wrote: > On Wed, May 14, 2014 at 5:38 PM, Richard Sandiford > wrote: >> Vladimir Makarov writes: >>> On 2014-05-13, 6:27 AM, Kyrill Tkachov wrote: Hi all, In haifa-sched.c (in rank_for_schedule) I notice that live range shrinkage is not performed when SCHED_PRESSURE_MODEL is used and the comment mentions that it results in much worse code. Could anyone elaborate on this? Was it just empirically noticed on x86_64? >>> It was empirically noticed on SPEC2000. The practice is a single >>> criteria for heuristic optimizations. Sometimes a new heuristic >>> optimization might look promising but the reality might be quite different. > Vlad - Was that based on experiments on x86_64 ? > > Yes, I benchmarked x86 and x86-64. I believe this pass can help when we don't use the 1st insn scheduler (that is x86/x86-64 case). If the 1st insn scheduler is profitable, I guess it is better to use register-pressure insn scheduling than live-range shrinkage + insn scheduling (even with register pressure). Even for x86/x86-64 the improvement was quite small for live-range shrinkage. Therefore it is not a default optimization. There are a set of optimizations in GCC which can improve some cases worsen others in practically equal number of cases. They could be a good candidate for machine-learning option choosing (e.g. MILEPOST project) because it is hard to predict for human when they help (if it would be easy then we could switch on these optimizations only for such cases). I guess somebody could continue work on improving live-range shrinkage on the scheduler code base. May be there are better approaches to mine here.
Re: Live Range Splitting in Integrated Register Allocator
On 05/15/2014 03:28 AM, Ajit Kumar Agarwal wrote: > > On 2014-05-14, 1:33 PM, Ajit Kumar Agarwal wrote: > >> Hello All: >> >> I am planning to implement the Live range splitting based on the following >> cases in the Integrated Register Allocator. >> >> For a given Live range that spans from from outer region to inner region of >> the loop. Such Live ranges which are LiveIn at the entry of the header of >> the Loop and Live Out at the exit of the loop but there are no references >> inside the Loop. Such Live ranges lead to unoptimal spill and fetch inside >> the Loop conflicting with the shorter live ranges that spans inside the Loop. >> >> Lets say such Live range as L1. L1 can be splitted at the Loop Boundary >> splitting the Live range by making a store at the header of the Loop and the >> Load at the exit of the Loop. This makes the Live range less conflicting >> with the Live ranges that are local to the Loop regions reducing the spill >> and Fetch inside the Loops. >> >> From the code and documentation of Integrated Register Allocator following >> is the understanding. >> >> As Live range L1 is live in the outer region but as there are no reference >> inside the Loop region. Since the allocno for L1 for a given variable v is >> assigned two allocno v1 and v2 . V1 being assigned allocno for the outer >> region and v2 as allocno for the inner Loop region. This allows to >> accumulate the information from the inner loop region to outer region. >> >> Will the current Integrated Register Allocator will consider the Live range >> L1 as Live inside the Loop and outer region? If Yes then there will be >> conflicting with the Live ranges that are local to the Loop region leading >> to spill and fetch inside the Loop. If the v1 and v2 allocno are created v1 >> for the outer region and v2 for the inner region then there will v2 will be >> conflicting the local live ranges inside the Loop region and v1 will be >> conflicting with the Live ranges of the outer regions. This is how its been >> considered as Live range splitting at the Loop Boundary for the Live range >> that spans inside the Loop but not not being referenced? >> >> If Such cases are not being considered in the Integrated Register Allocator, >> then it will be useful to implement such cases in IRA which will be >> benefitted the microblaze target. >> >> Please let me know what do you think. >> >>> Allocno v2 corresponding to live range inside the loop has a very small >>> cost for spilling therefore it will be spilled if we still need registers >>> to pseudos local to the loop. If allocno v1 >>corresponding live ranges >>> outside the loop *and* inside the loop gets a hard register, we will have >>> live range splitting as you propose. So I do not see a necessity for the >>> optimization >>you propose. > If the allocno v1 does n't get the hard register and become the candidate of > spill which is Live that spans the Loop but not reference inside the Loop, it > will lead to spills inside the loop which become bottleneck as performance is > concerned. Does this case also handled in the IRA? With my proposal the load > and store will be moved outside the loop instead of having it inside the Loop > for such case. > If v1 does not get a hard register (and v2 most probably too), than there will be no any additional insns because v1 and v2 most probably gets the same memory. If v1 gets a hard register and v2 does not, then we have store on the loop enter and loads on the loop exits (the store/loads are not inside the loop). >>> Moreover my experience shows that making a lot of explicit transformations >>> (e.g. proposed splitting) even if we have transformations to undo them >>> (e.g. coalescing) results in worse >>code. >>> The explicit transformations should be as minimal as possible during RA in >>> a good register allocator. So I guess the optimization you propose will >>> actually results in a worse code. >>Although I might be wrong because it >>> is always hard to predict the result of heuristic optimizations. > There is a paper on Improved Passive Splitting which does take care of > Splitting at Loop boundaries for the Live range that spans the Loop but not > referenced. > > Improved Passive Splitting (2005) by Keith D. Cooper , Jason Eckhardt. This > paper uses the heuristics and claim to reduce the spill and fetch for the > case I propose. Also claims to generate the better code. > I read this article long ago (actually I worked with Jason for a few years until he left us to Rice University post-graduated school). Usually heuristic optimizations (including RA) are very sensitive to compiler environment and overall design. So the effect of algorithm could be quite different in different compiler environment. Another analogous approach is Peter Begner's one. The both is based on classical Chaitin-Briggs RA good for regular file architectures. IRA is different. It is a regional RA (analogou
RE: Live Range Splitting in Integrated Register Allocator
Thanks Vladimir for the clarification. Thanks & Regards Ajit -Original Message- From: Vladimir Makarov [mailto:vmaka...@redhat.com] Sent: Thursday, May 15, 2014 8:39 PM To: Ajit Kumar Agarwal; gcc@gcc.gnu.org Cc: Michael Eager; Vinod Kathail; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Live Range Splitting in Integrated Register Allocator On 05/15/2014 03:28 AM, Ajit Kumar Agarwal wrote: > > On 2014-05-14, 1:33 PM, Ajit Kumar Agarwal wrote: > >> Hello All: >> >> I am planning to implement the Live range splitting based on the following >> cases in the Integrated Register Allocator. >> >> For a given Live range that spans from from outer region to inner region of >> the loop. Such Live ranges which are LiveIn at the entry of the header of >> the Loop and Live Out at the exit of the loop but there are no references >> inside the Loop. Such Live ranges lead to unoptimal spill and fetch inside >> the Loop conflicting with the shorter live ranges that spans inside the Loop. >> >> Lets say such Live range as L1. L1 can be splitted at the Loop Boundary >> splitting the Live range by making a store at the header of the Loop and the >> Load at the exit of the Loop. This makes the Live range less conflicting >> with the Live ranges that are local to the Loop regions reducing the spill >> and Fetch inside the Loops. >> >> From the code and documentation of Integrated Register Allocator following >> is the understanding. >> >> As Live range L1 is live in the outer region but as there are no reference >> inside the Loop region. Since the allocno for L1 for a given variable v is >> assigned two allocno v1 and v2 . V1 being assigned allocno for the outer >> region and v2 as allocno for the inner Loop region. This allows to >> accumulate the information from the inner loop region to outer region. >> >> Will the current Integrated Register Allocator will consider the Live range >> L1 as Live inside the Loop and outer region? If Yes then there will be >> conflicting with the Live ranges that are local to the Loop region leading >> to spill and fetch inside the Loop. If the v1 and v2 allocno are created v1 >> for the outer region and v2 for the inner region then there will v2 will be >> conflicting the local live ranges inside the Loop region and v1 will be >> conflicting with the Live ranges of the outer regions. This is how its been >> considered as Live range splitting at the Loop Boundary for the Live range >> that spans inside the Loop but not not being referenced? >> >> If Such cases are not being considered in the Integrated Register Allocator, >> then it will be useful to implement such cases in IRA which will be >> benefitted the microblaze target. >> >> Please let me know what do you think. >> >>> Allocno v2 corresponding to live range inside the loop has a very small >>> cost for spilling therefore it will be spilled if we still need registers >>> to pseudos local to the loop. If allocno v1 >>corresponding live ranges >>> outside the loop *and* inside the loop gets a hard register, we will have >>> live range splitting as you propose. So I do not see a necessity for the >>> optimization >>you propose. > If the allocno v1 does n't get the hard register and become the candidate of > spill which is Live that spans the Loop but not reference inside the Loop, it > will lead to spills inside the loop which become bottleneck as performance is > concerned. Does this case also handled in the IRA? With my proposal the load > and store will be moved outside the loop instead of having it inside the Loop > for such case. > If v1 does not get a hard register (and v2 most probably too), than there will be no any additional insns because v1 and v2 most probably gets the same memory. If v1 gets a hard register and v2 does not, then we have store on the loop enter and loads on the loop exits (the store/loads are not inside the loop). >>> Moreover my experience shows that making a lot of explicit transformations >>> (e.g. proposed splitting) even if we have transformations to undo them >>> (e.g. coalescing) results in worse >>code. >>> The explicit transformations should be as minimal as possible during RA in >>> a good register allocator. So I guess the optimization you propose will >>> actually results in a worse code. >>Although I might be wrong because it >>> is always hard to predict the result of heuristic optimizations. > There is a paper on Improved Passive Splitting which does take care of > Splitting at Loop boundaries for the Live range that spans the Loop but not > referenced. > > Improved Passive Splitting (2005) by Keith D. Cooper , Jason Eckhardt. This > paper uses the heuristics and claim to reduce the spill and fetch for the > case I propose. Also claims to generate the better code. > I read this article long ago (actually I worked with Jason for a few years until he left us to Rice University post-graduated school). Usually heuristic op
Re: [GSoC] writing test-case
Prathamesh Kulkarni writes: > +/* { dg-final { scan-tree-dump "gimple_match_and_simplified to \[^\n\r\]*= > x_\\d\+\\(D\\) - y_\\d\+\\(D\\)" "forwprop1" } } */ No need to quote +, it's not special to tcl. Andreas. -- Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different."
gcc-4.8-20140515 is now available
Snapshot gcc-4.8-20140515 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/4.8-20140515/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 4.8 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_8-branch revision 210479 You'll find: gcc-4.8-20140515.tar.bz2 Complete GCC MD5=5a0c9bb6eda9ba9dcc2253b0654fd7dc SHA1=64b835f56bce6f86c278b81b48f86c6b0ec0c9b1 Diffs from 4.8-20140508 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-4.8 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.