Re: Scheduling automaton question
Hi, According to me at this moment the scheduler does not support your needs. I was confronted with a similar problem as yours and I solved it by implementing the TARGET_SCHED_DFA_NEW_CYCLE hook. Inside of the function which supports this hook I choose/set the insn reservation that makes possible to fit as many other insn as possible in the same cycle. During this process I also update the ready list with insns that become ready as a result of scheduling the current insn ( like in your example -insns that are anti-dependent on the current insn and which therefore can be scheduled in the current cycle). Thus the best insn reservation makes possible scheduling antidependent insns of cost zero in the same cycle by avoiding resource conflicts. Alex few changes in the gcc mainline sources --- On Fri, 2/11/11, Bernd Schmidt wrote: > From: Bernd Schmidt > Subject: Scheduling automaton question > To: "GCC List" > Cc: "Vladimir N. Makarov" > Date: Friday, February 11, 2011, 2:33 PM > Suppose I have two insns, one > reserving (A|B|C), and the other reserving > A. I'm observing that when the first one is scheduled in an > otherwise > empty state, it reserves the A unit and blocks the second > one from being > scheduled in the same cycle. This is a problem when there's > an > anti-dependence of cost 0 between the two instructions. > > Vlad - two questions. Is this behaviour what you would > expect to happen, > and how much work do you think would be involved to fix it > (i.e. make > the first one transition to a state where we can still > reserve any two > out of the three units)? > > > Bernd >
Re: Question about the difference between two instruction scheduling passes
> Gcc only does this work in the second pass, but what's the > point? Is it wrong or just not necessary in the first sched > pass? Regardless of the target architecture from the correctness point of view sched1 can be disabled. sched1 has as purpose shortening live ranges. Short live ranges allow the register allocation to: 1. generate less spills and 2. also to avoid useless live ranges splitting. The second point impacts the register renaming phase which allows sched2 phase to extract more ilp.
Re: IRA undoing scheduling decisions
> With 4.4, IRA happens to reuse the same register for both pseudos, so > sched2 is hand tied and cannot schedule them back again for us. I can imagine compiling other programs for which preserving the 4.3 allocation will induce performance degradation due to spilling. The register allocator tries to minimize the number of spills without taking into account the ILP implications (i.e., creating extra false dependencies). Perhaps one possible way to solve the problem would be to analyze why the register rename phase (which is responsible for spreading the registers) does not produces 2 registers. --- On Wed, 8/26/09, Peter Bergner wrote: > From: Peter Bergner > Subject: Re: IRA undoing scheduling decisions > To: "Charles J. Tabony" > Cc: gcc@gcc.gnu.org > Date: Wednesday, August 26, 2009, 11:47 PM > On Mon, 2009-08-24 at 23:56 +, > Charles J. Tabony wrote: > > I am seeing a performance regression on the port I > maintain, and I would appreciate some pointers. > > > > When I compile the following code > > > > void f(int *x, int *y){ > > *x = 7; > > *y = 4; > > } > > > > with GCC 4.3.2, I get the desired sequence of > instructions. I'll call it sequence A: > > > > r0 = 7 > > r1 = 4 > > [x] = r0 > > [y] = r1 > > > > When I compile the same code with GCC 4.4.0, I get a > sequence that is lower performance for my target > machine. I'll call it sequence B: > > > > r0 = 7 > > [x] = r0 > > r0 = 4 > > [y] = r0 > > This is caused by update_equiv_regs() which IRA inherited > from local-alloc.c. > Although with gcc 4.3 and earlier, you don't see the > problem, it is still there, > because if you look at the 4.3 dumps, you will see that > update_equiv_regs() > unordered them for us. What is saving us is that > sched2 reschedules them > again for us in the order we want. With 4.4, IRA > happens to reuse the same > register for both pseudos, so sched2 is hand tied and > cannot schedule them > back again for us. > > Looking at update_equiv_regs(), if I disable the > replacement for regs > that are local to one basic block (patch below) like it > existed before > John Wehle's patch way back in Oct 2000: > > http://gcc.gnu.org/ml/gcc-patches/2000-09/msg00782.html > > then we get the ordering we want. Does anyone know > why John removed > that part of the test in his patch? Thoughts anyone? > > > Peter > > > Index: ira.c > === > --- ira.c (revision 15) > +++ ira.c (working copy) > @@ -2510,6 +2510,7 @@ update_equiv_regs (void) > > calls. */ > > if > (REG_N_REFS (regno) == 2 > + > && REG_BASIC_BLOCK (regno) < NUM_FIXED_BLOCKS > > && (rtx_equal_p (x, src) > > || ! equiv_init_varies_p (src)) > > && NONJUMP_INSN_P (insn) > > > >
question about DSE
Dear all, Im writing to you regarding the dead store elimination (dse) which runs after register allocation. Apparently dse removes wrongly the following store (present in bb2): (insn 374 47 52 2 test.c:107 (set (mem/c:SI (plus:PSI (reg/f:PSI 55 ptr15) (const_int 96 [0x60])) [19 fac_iter+0 S4 A32]) (reg/v:SI 16 r16 [orig:161 step109 ] [161])) 48 {si_indexed_store_incl_ra} (nil)) despite being consumed (in bb3) by the following 2 loads: (insn 380 71 64 3 test.c:112 (set (reg:HI 1 r1) (mem:HI (plus:PSI (reg/f:PSI 55 ptr15) (const_int 96 [0x60])) [0 S2 A16])) 12 {load} (nil)) (insn 382 346 65 3 test.c:112 (set (reg:HI 5 r5) (mem:HI (plus:PSI (reg/f:PSI 55 ptr15) (const_int 98 [0x62])) [0 S2 A16])) 12 {load} (nil)) Can anyone point what may be the problem? As you can see the store is SI while the loads are HI. While looking to the comments from dse.c I get to the following remark: " There are three cases where dse falls short: a) Reload sometimes creates the slot for one mode of access, and then inserts loads and/or stores for a smaller mode. " Does it mean that such cases are not treated properly by dse? I observed that if I run with the flag -fno-strict-aliasing the wrongly removed store is no longer removed and the code is runs correctly. Im wondering does the dse after register allocation make use of type based alias analysis? reagards, Alex
Re: question about DSE
Hi Michael, > My assumption would be these two split loads of HImode are > generated by your backend from a given SImode MEM. Indeed your asumption is right. Bellow I have a mulsi3 expand in which I generate insns of mode HI. operands[1] gets spilled: in the produced BB as a single SI store while in the consumer BB as two separte HI loads (see a_hi and a_lo). (define_expand "mulsi3" [(match_operand: SI 0 "general_register_operand" "") (match_operand: SI 1 "general_register_operand" "") (match_operand: SI 2 "general_register_operand" "")] "" "{ rtx buff = gen_reg_rtx(SImode); rtx a_lo = gen_rtx_SUBREG(HImode, operands[1], 0); rtx a_hi = gen_rtx_SUBREG(HImode, operands[1], 2); rtx b_lo = gen_rtx_SUBREG(HImode, operands[2], 0); rtx b_hi = gen_rtx_SUBREG(HImode, operands[2], 2); rtx r_hi = gen_rtx_SUBREG(HImode, buff, 2); emit_insn(gen_umulhisi3(buff, a_lo, b_lo)); emit_insn(gen_machi3(r_hi, a_hi, b_lo, r_hi)); emit_insn(gen_machi3(r_hi, a_lo, b_hi, r_hi)); emit_move_insn(operands[0], buff); DONE; }") > If so, you need > to make sure to copy the MEM_ALIAS_SET, at least for spill slots (better > for everything) into the newly generated HImode mems. For spill slots > it's not enough to set it to zero. I get your point but as the generation SI->HI takes place in the expand it doesnt help to copy the MEM_ALIAS_SET becasue the operands are pseudo regs. However, to get a correct implementation I did the following. Instead of doing the split in the expand (as show above), I made use of the following define_insn_and_split: (define_expand "mulsi3" [(parallel [(set (match_operand:SI 0 "register_operand" "") (mult:SI (match_operand:SI 1 "register_operand" "") (match_operand:SI 2 "nonmemory_operand" ""))) (clobber (match_operand:SI 3 "register_operand" "")) ] ) ] "" "{ operands[3] = gen_reg_rtx(SImode); }") (define_insn_and_split "*mulsi3" [(parallel[(set (match_operand:SI 0 "register_operand" "=d,d") (mult:SI (match_operand:SI 1 "register_operand" "d,d") (match_operand:SI 2 "nonmemory_operand" "d,I"))) (clobber (match_operand:SI 3 "register_operand" "=d,d")) ])] "" "#" "reload_completed" [(clobber (const_int 0))] "{ rtx a_lo = gen_rtx_SUBREG(HImode, operands[1], 0); rtx a_hi = gen_rtx_SUBREG(HImode, operands[1], 2); rtx b_lo = gen_rtx_SUBREG(HImode, operands[2], 0); rtx b_hi = gen_rtx_SUBREG(HImode, operands[2], 2); rtx r_hi = gen_rtx_SUBREG(HImode, operands[3], 2); emit_insn(gen_umulhisi3(operands[3], a_lo, b_lo)); emit_insn(gen_machi3(r_hi, a_hi, b_lo, r_hi)); emit_insn(gen_machi3(r_hi, a_lo, b_hi, r_hi)); emit_move_insn(operands[0], operands[3]); DONE; }") By using this define_insn_and_split with the predicate "reload_completed" I ensure that the register allocation takes place on the operands of the "mulsi3" instruction as defined by the define_expand construct. In this way instead of the two separate HI loads (from my previouse mail) I get only one SI load which aliases whith the SI store. In consequence the SI store is no longer removed. 1.What do you think about this implementation? using define_insn_and_split 2.Is is true that in the define_expand constructs I should avoid inducing subregs? thanks, Alex
bug linear loop transforms
Im writing to you regarding a possible bug in linear loop transfor. The bug can be reproduce by compiling the attached c file with gcc.4.5.0 (20100204, 20100325) on x86 machine. The compiler flags that reproduce the error are: -O2 -fno-inline -fno-tree-ch -ftree-loop-linear If the compiler is run with: -O2 -fno-inline -fno-tree-ch -fno-tree-loop-linear then the produced code is correct. #include int test (int n, int *a) { int i, j; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { a[j] = i + n; } } if (a[0] != 31 || i + n - 1 != 31) printf("incorrect %d %d \n", a[0], i+n-1); return 0; } int main (void) { int a[16]; test (16, a); return 0; }
generate assembly mnemonic depending the resource allocation
Hi all, Im building a gcc target for a vliw machine that can execute the same instruction on different resources (slots) and depending on which resources are allocate the instruction must have a different mnemonic. Is it possible in gcc to have for the same define_insn constraints (depending on the allocated architecture resources) different assembly instructions? Here is an example: Consider the following addSI RTL pattern: (define_insn "addsi3" [(set (match_operand:SI 0 "general_register_operand" "=d") (plus:SI (match_operand:SI 1 "general_register_operand" "d") (match_operand:SI 2 "general_register_operand" "d")))] "" "add %0,%1,%2%" [(set_attr "type" "alu") (set_attr "mode" "SI") (set_attr "length" "1")]) On my target machine "alu" is a reservation that occupies one of the following 3 slots: "slot1|slot2|slot3" and, I need to generate assembly code with different mnemonic depending on which slot the instruction was scheduled: add-slot1 %0,%1,%2% // if scheduled on slot 1 add-slot2 %0,%1,%2% // if scheduled on slot 2 add-slot3 %0,%1,%2% // if scheduled on slot 3 Alex
RE: generate assembly mnemonic depending the resource allocation
Hi, Thanks for answer but Im not able to make it work; I dont know how to extract the specific slot-usage (out of a number of an alternatives) out of the RTL insn description. Is there some get_attr_ for this? --- On Wed, 12/3/08, Bingfeng Mei <[EMAIL PROTECTED]> wrote: > From: Bingfeng Mei <[EMAIL PROTECTED]> > Subject: RE: generate assembly mnemonic depending the resource allocation > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>, "gcc@gcc.gnu.org" > > Date: Wednesday, December 3, 2008, 3:00 PM > You can use C statements to return a modified template > string such like > (define_insn "addsi3" > [(set (match_operand:SI 0 > "general_register_operand" "=d") > (plus:SI (match_operand:SI 1 > "general_register_operand" "d") > (match_operand:SI 2 > "general_register_operand" "d")))] > "" > { > switch (slot-used){ > case 0: > return "add-slot0, %0, %1, %2"; > case 1: > return "add-slot1, %0, %1, %2"; > case 2: > return "add-slot1, %0, %1, %2"; > } > } > [(set_attr "type" "alu") >(set_attr "mode" "SI") >(set_attr "length" "1")]) > > > -Original Message- > > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On > > Behalf Of Alex Turjan > > Sent: 03 December 2008 10:34 > > To: gcc@gcc.gnu.org > > Subject: generate assembly mnemonic depending the > resource allocation > > > > Hi all, > > Im building a gcc target for a vliw machine that can > execute > > the same instruction on different resources (slots) > and > > depending on which resources are allocate the > instruction > > must have a different mnemonic. Is it possible in gcc > to have > > for the same define_insn constraints (depending on the > > > allocated architecture resources) different assembly > instructions? > > > > Here is an example: > > Consider the following addSI RTL pattern: > > (define_insn "addsi3" > > [(set (match_operand:SI 0 > "general_register_operand" "=d") > > (plus:SI (match_operand:SI 1 > "general_register_operand" "d") > > (match_operand:SI 2 > > "general_register_operand" "d")))] > > "" > > "add %0,%1,%2%" > > [(set_attr "type" "alu") > >(set_attr "mode" "SI") > >(set_attr "length" "1")]) > > > > On my target machine "alu" is a reservation > that occupies one > > of the following 3 slots: > "slot1|slot2|slot3" and, I need to > > generate assembly code with different mnemonic > depending on > > which slot the instruction was scheduled: > > > > add-slot1 %0,%1,%2% // if scheduled on slot 1 > > add-slot2 %0,%1,%2% // if scheduled on slot 2 > > add-slot3 %0,%1,%2% // if scheduled on slot 3 > > > > Alex > > > > > > > > > > > > > > > > > > > > > > > > > > > >
Re: generate assembly mnemonic depending the resource allocation
> Keep > track of the instruction slot, and set some global variable > in that macro. I see what you mean but for my target architecture the slots are not identical in sense that they are claiming totaly different resources. This means that in order to decide at a certain cycle which mnemonic to instantiate I need to guarantee that there are no pipeline conflicts. Im now looking the in gcc mainline to find when an instruction (with alternative resource utilizations) is issued which of the resources are used.
Re: generate assembly mnemonic depending the resource allocation
Dear Ian, I use as atribute of one of the instructions the following define_insn_reservation: (define_insn_reservation "vmove" 1 "vector_type" "vmove") "c_valu_1|c_vlsu_1") As you can see vmove has two alternative reservation : c_valu_1 or c_vlsu_1, where c_valu_1 is defined as follows: (define_reservation "c_valu_1" "nothing, r_valu_rp, nothing, nothing, nothing, nothing") and c_vlsu_1: (define_reservation "c_vlsu_1" "nothing, r_vlsu, nothing, nothing, nothing, nothing") Having an RTX after sched2, how can I decide which of the two alternatives (c_valu_1 or c_vlsu_1) the scheduler has selected? thanks for your help! Alex
query automaton
Hello, Some time ago I asked a question regarding the possibility to schedule an operation on alternative functional units (FUs) AND depending on the chosen FUs to generate a dedicated assembly mnemonic. To give a simple example suppose I have a move operation that can be issued on one of the 2 different FUs: (define_reservation "si_move""(q_salu, r_salu_rp,r_salu + r_salu_r_wp) |(q_smac, nothing, r_smac + r_smac_wp) ,nothing* 3") Depending on the occupied FUs one of the following assembly instructions are generated: move_salu or move_smac. To find out which FUs were occupied I defined 2 query_cpu_unit-s: (define_query_cpu_unit "q_salu" "aut1") (define_query_cpu_unit "q_smac" "aut1") ,which further on (during the sched2 hook TARGET_SCHED_DFA_NEW_CYCLE) are used to find out which of the 2 alternative reservations the automaton has taken. Based on this decision I add some information within the insn rtx which later on is used to dump "move_salu" or "move_smac". This approach was successful as long as the units claimed by "si_move" belong to one and the same automaton ( i.e., q_salu, q_smac, r_salu_rp, r_salu, r_salu_r_wp, r_smac, r_smac_wp belong to "aut1"). If I try to split "aut1" into (lets say 2) automatons I get generated incorrect code (i.e., the choice of choosing the proper unit is not correct anymore.) QUESTION: Is there such constraint that the units part of the alternatives of a reservation must belong to the same automaton?
Re: query automaton
Dear Vladimir, > Not really. There is no requirement for "the units > part of the alternatives of a reservation must belong to the > same automaton". Querying should also work in this > case because function cpu_unit_reservation_p checks all > automata for an unit reservation. Indeed it checks all automata but Im afraid that according to my pipeline description this check is not enough to guarantee a correct scheduling decision, e.g., suppose the following insn reservation: (define_reservation "move" "( (unit1_aut1, unit1_aut2) | (*) (unit2_aut1, unit2_aut2) ) ,where unitN_autM refers to unit N from automata M. In this case there are 2 automata. Now supose a scheduling state S made of the individual states of the two automatons S=. According to what I see happening in insn-automata.c (and target.dfa), from S_aut1 there is a transition for unit1_aut1 and from S_aut2 there is a transition for unit2_aut2. It seems that the automata do not communicate with each other. As a consequence, A scheduling decision which results in the resource reservation (unit1_aut1, unit2_aut2) would not be rejected, while it should. In my opinion, the current implementation sees the reservation defined in (*) As equivalent to the following one (define_reservation "move" "( (unit1_aut1| unit2_aut1) , (unit1_aut2| unit2_aut2) ) Which does not seem true to me. Is there a way for automatons to communicate so that the alternative (unit1_aut1, unit2_aut2) would be rejected? regards, Alex
Re: query automaton
Vladimir thanks for help! With respect to your answer I would like to point to you a possible way to solve the scheduling problem (which I think wont be too difficult to be implemented in genautomata). To explain it, I will make use of the “move insn” example pointed in the previous email. First of all I do not change anything in the 2 automatons (aut1 and aut2); they are correctly built by the genautomata. Next to the original > (define_reservation "move" "( (unit1_aut1, unit1_aut2) | >(unit2_aut1, unit2_aut2) > ) I define two separate fake move instructions (i.e., “move_fake1” and “move_fake2”) of which semantic does play any role; each of them having a unit reservation one of the alternatives of the “move”: (define_reservation "move_fake1" "( (unit1_aut1, unit1_aut2)), and (define_reservation "move_fake2" "( (unit2_aut1, unit2_aut2) ). Those two extra moves are then used during the sched2 target hook TARGET_SCHED_DFA_NEW_CYCLE to decide which of the move alternative unit reservations can be correctly claimed. Suppose the ready instruction is a “move”. First I make 2 copies of the current state: temp_state1 and temp_state2. Then I check which of the two fake moves ( “move_fake1” and “move_fake2”) with different unit occupation can be issued at the current state: cost_move_fake1 = internal_state_transition (insn_code_move_fake1, temp_state1); cost_move_fake2 = internal_state_transition (insn_code_move_fake2, temp_state2); Once I find the correct choice (suppose cost_move_fake1 <0), I set dfa_insn_codes[] of the ready instruction to the one of move_fake1. int uid = INSN_UID (ready); if (uid >= dfa_insn_codes_length) dfa_insn_code_enlarge (uid); dfa_insn_codes[uid] = internal_dfa_insn_code (insn_code_move_fake1); In this way when the scheduler calls internal_state_transition, the ready instruction has been already set its dfa_insn_codes to a unit reservation schedulable in the current state. regards, Alex --- On Tue, 3/3/09, Vladimir Makarov wrote: > From: Vladimir Makarov > Subject: Re: query automaton > To: atur...@yahoo.com > Cc: gcc@gcc.gnu.org > Date: Tuesday, March 3, 2009, 4:35 PM > Alex Turjan wrote: > > Dear Vladimir, > >> Not really. There is no requirement for "the > units > >> part of the alternatives of a reservation must > belong to the > >> same automaton". Querying should also work > in this > >> case because function cpu_unit_reservation_p > checks all > >> automata for an unit reservation. > > Indeed it checks all automata but Im afraid that > according to my pipeline description this check is not > enough to guarantee a correct scheduling decision, e.g., > > suppose the following insn reservation: > > (define_reservation "move" "( > (unit1_aut1, unit1_aut2) | (*) > > (unit2_aut1, > unit2_aut2) ) > > ,where unitN_autM refers to unit N from automata M. In > this case there are 2 automata. Now supose a scheduling > state S made of the individual states of the two automatons > S=. According to what I see happening > in insn-automata.c (and target.dfa), from S_aut1 there is a > transition for unit1_aut1 and from S_aut2 there is a > transition for unit2_aut2. > > It seems that the automata do not communicate with > each other. As a consequence, A scheduling decision which > results in the resource reservation (unit1_aut1, > unit2_aut2) would not be rejected, while it should. In my > opinion, the current implementation sees the reservation > defined in (*) > > As equivalent to the following one > > (define_reservation "move" "( > (unit1_aut1| unit2_aut1) , > > (unit1_aut2| > unit2_aut2) ) Which does not seem true to me. > > Is there a way for automatons to communicate so that > the alternative (unit1_aut1, unit2_aut2) would be rejected? > > > > > Last two days, I've been working on this issue. I > found that you are right, genautomata permits to generate > incorrect automata although I did not find that it is a case > for major current description which I checked. > > I found the issue is complicated. Genuatomata has already > a check for correct automata generations in > check_regexp_units_distribution but that is not enough. I > am working on formulation of general rules for correct > automata generation and implementation of their check. I > think it will take a week or two to do this and check how > does it work for all current automata description. > > Alex, thanks for pointing out this issue to me.
Re: [gSoc] [graphite] general plan for Automatic parallelization in Graphite
Are there any plans to move the partial unrolling phase from RTL to Tree-SSA? The move would benefit from better (or easier to implement) Tree-SSA alias analysis. Alex > > > > > --- On Wed, 4/22/09, Li Feng > wrote: > > > From: Li Feng > > Subject: [gSoc] [graphite] general plan for Automatic > parallelization in Graphite > > To: "GCC" > > Cc: "Tobias Grosser" > , "Sebastian" > , "Razya Ladelsky" > , konrad.trifuno...@gmail.com > > Date: Wednesday, April 22, 2009, 5:10 PM > > Hi, > > > > It's nice that the proposal 'Automatic > > parallelization in Graphite' > > is accepted. Which means I will be working with great > > Graphtie > > developers this summer, and trying to implement the > project > > . > > > > I have set up a blog for this project, which will > mainly > > about this > > project: 1. plans 2. what I have done 3. related > Graphite > > internals > > You can subscribe to it if you like: > > http://summergraphite.blogspot.com/ > > > > Here is a general plan for this project, keep you in > loop, > > and feel free to comment :) > > > > 1. Mark the innermost loop parallel [done] > > > > 2. Try to schedule autopar pass after Graphite, and > enable > > code generation if flag_graphite_force_parallel > is set > > - There should be some discussion with Razya > about > >her plan about the autopar part > > - But before that, I'll try to schedule > > autopar first > > > > 3. I may try to write testcases for the loops that > should > > be > > parallel, from simple to hard, and check > autopar's > > code > > generation part, make sure this works correctly > as we > > expected. > > - The testcases is important. There should be > some > >detailed discussion maybe with Sebastian > and > > Konrad. > >To see what kind of loop we can/decide to > > handle. > > - Check autopar's code generation with > >flag_graphite_force_parallel set with these > > testcases, > >report bugs if it goes wrong. > > > > 4. Try to write code for deciding if a loop can be > > parallel > > with data dependency test under this polyhedral > model. > > - Try to understand the interface of data > > dependency test > > - Write code, if data dependency success, mark > the > > loop parallel > > > > Cheers, > > Li
Re: Fwd: gcc instruction scheduling makes things worse?
> For data dependency cases, I do some jobs in > the adjust_cost target hook. Normally the scheduling takes into account the instruction latencies which you have specified: for load 2, for mul 4 and for alu 1. Why do you need to adjust the sched costs? Did you try simply without adjusting the costs? > I read THE GNU INSTRUCTION SCHEDULER written by > Michael D. Tiemann, This is old work. Meanwhile there is a new scheduler based on finite state machines. You can read in "The finite state automaton based pipeline hazard recognizer and instruction scheduler in GCC" by Vladimir Makarov.
scheduling question
Hi, During scheduling Im confronted with the fact that an instruction is moved from the ready list to queued with the cost 2, while according to my expectations the insn should have been moved to queued with cost 1. Did anybody experience similar problem? In case an insn is ready but can bot be schedled in the current cycle, is it correct (i.e. the generated code is correct) to move the insn to the queue list with cost 1 ?; no matter what it the value >=1 returned by state_transition. It seams to me that moving from the ready to queue list with cost >=1 is an optimization for compilation time.
Re: scheduling question
--- On Thu, 5/7/09, Maxim Kuvyrkov wrote: > From: Maxim Kuvyrkov > Subject: Re: scheduling question > To: atur...@yahoo.com > Cc: "Vladimir Makarov" , gcc@gcc.gnu.org > Date: Thursday, May 7, 2009, 1:01 PM > Alex Turjan wrote: > > Hi, > > During scheduling Im confronted with the fact that an > instruction is moved > > from the ready list to queued with the cost 2, while > according to my > > expectations the insn should have been moved to queued > with cost 1. > > > > Did anybody experience similar problem? > > From what you described it's not clear what the problem > is. When scheduler infers that an instruction cannot be > scheduled in the next N cycles (due to DFA hazard or > insn_cost/dep_cost hook considerations or due to something > else) the scheduler queues the instruction on (N+1) cycle. The point is that in found a case in which an insn is moved to queued with cost 2 while it could have been issued in next cycle (which corresponds to queued with cost 1). I think the problem is not in the automaton state transition but in the vector_min_issue_delay. Do you have some documentation of the way vector_min_issue_delay is constructed? > Correct, scheduler would be working unnecessarily long > otherwise. Maxim thanks a lot for your answer.
Re: Illegal schedule
> Do I have to reorganize the code prior to slot filling? Do > I have to make sure that some problematic instructions do > not appear in slots? Perhaps a easy way to solve the problem would be to claim for branches a memory port a number of stages before and after the IF; to avoid in this way hazards in the delay slots. Alternatively you can adjust the gcc delay slot filler (reorg.c) for your architecture.
reload question
Hi, My port supports hardware loops generated by the following (do_end) pattern: (set (pc) (if_then_else (ne (match_operand:HI 0 "general_register_operand" "d") (const_int 0)) (label_ref (match_operand 1 "" "")) (pc))) (set (match_operand:HI 2 "general_register_operand" "=0") (plus:HI (match_operand:HI 3 "general_register_operand" "0") (const_int -1))) (clobber (match_operand:BI 4 "predicate_register_operand" "=j")) When Im compiling a loop with high register pressure the register allocator does not allocate a register for the loop counter (i.e., operand 0) as it has a long life range. Thus operand 0 has to be reloaded but then I get a failure in the reload: error: unable to generate reloads for: (jump_insn 211 182 188 11 lib5.c:51 (parallel [ (set (pc) (if_then_else (ne (reg:HI 292 [ N ]) (const_int 0 [0x0])) (label_ref 183) (pc))) (set (reg:HI 292 [ N ]) (plus:HI (reg:HI 292 [ N ]) (const_int -1 [0x]))) (clobber (reg:BI 33 p1 [293])) ]) 506 {do_endhi} (expr_list:REG_UNUSED (reg:BI 33 p1 [293]) (expr_list:REG_BR_PROB (const_int 9100 [0x238c]) (nil))) -> 183) lib5.c:105:1: internal compiler error: in find_reloads, at reload.c:3821 Can anybody give me a hint? I am aware of the following msg: http://gcc.gnu.org/ml/gcc/2001-09/msg00942.html but still dont know how to make reload do the work. thanks, Alex
Re: reload question
> insns which branch are not allowed to have output > reloads. You must > support any kind of register as well as memory operands in > your insn for > the loop counter. Thanks for answer but what do you suggest to do, as my architecture done not support HW loops with memory operands? Alex
CFG traversal
Hi, Is there functionality in gcc based on which the CFG can be traversed in such a way that a node gets visited once all of its predecessors have been visited? thanks, Alex
secondary reload via 2 intermediary registers
Hi, I have 3 questions regarding secondary reload: 1.Is it possible to do the secondary reload via 2 intermediary registers? As far as I can see the insn that implements the secondary reload has to have 3 operands. 2. Is it possible that an instruction emitted during the secondary reload to get reloaded as well? and if yes how? thanks in advance, Alex
Re: secondary reload via 2 intermediary registers
Hi Jeff, Thanks for answer. I managed to make use of an architecture trick which allows me to get the secondary reload via only one intermediary register, but still have some comments to what you wrote me. > > 1.Is it possible to do the secondary reload via 2 > intermediary registers? > > As far as I can see the insn that implements the > secondary reload has to have 3 operands. > Make the scratch/intermediary register double-sized so that > you get a pair of registers instead of a single register for > the scratch/intermediary. In my case the intermediary registers were from two different register files holding data of different modes. To do what you suggest I would have had to interleave the two register files and introduce a new mode that occupies a pair of regs - one from each of the register files. > > 2. Is it possible that an instruction emitted during > the secondary reload to get reloaded as well? and if yes > how? > Typically you need to ensure that your reload_xxx expanders > generate RTL which does not need further reloading. > This makes handling secondary reloads rather complex in some > cases. I was asking for this because I was not able to do the secondary reload via 2 intermediary registers. Thus I was trying to provide the two registers - the first one via a first secondary reload which was generating two instructions among which one was needing again a secondary reload where I was planing to provide the second register. However the mechanism was not functioning.
unschedule insn functionality needed
Hi, Can anybody give me a hint on where (perhaps some branch) I can find functionality which allows during scheduling to un-schedule an instruction? I basically need a function that does the oposite of schedule_insn. During scheduling I want o schedule_insn(INSN), then check the ready_list and then (in case) unschedule INSN. thanks, Alex
Re: pipeline description
Alexander is right. Perhaps you can implement the TARGET_SCHED_ADJUST_COST , then catch in the debugger the two instructions that you expect to be scheduled together and see what the default latency is or if needed you may just adjust it to the proper value. Alex --- On Fri, 11/12/10, Alexander Monakov wrote: > From: Alexander Monakov > Subject: Re: pipeline description > To: "Ian Lance Taylor" > Cc: "roy rosen" , gcc@gcc.gnu.org > Date: Friday, November 12, 2010, 1:22 PM > > > On Thu, 11 Nov 2010, Ian Lance Taylor wrote: > > > roy rosen > writes: > > > > > If I have two insns: > > > r2 = r3 > > > r3 = r4 > > > It seems to me that the dependency analysis > creates a dependency > > > between the two and prevent parallelization. > Although there is a > > > dependency (because of r3) I want GCC to > parallelize them together. > > > Since if the insns are processed together the old > value of r3 is used > > > for the first insn before it is written by the > second insn. > > > How do I let GCC know about these things (When > exactly each operand is > > > used and when it is written)? > > > Is it in these hooks? > > > In which port can I see a good example for that? > > > > I was under the impression that an anti-dependence in > the same cycle was > > permitted to execute. But perhaps I am > mistaken. > > No, you are right. That is the norm when compiling > for ia64, for example. I > don't think the backend should specifically care about it: > the anti-dependency > gets zero latency, and the scheduler is able to issue the > second instruction > on the same cycle after issuing the first. > > Alexander >
selective scheduler failure
Hi, I get an assert failure during a run of the selective scheduler. The problem is due to the implementation of maybe_tidy_empty_bb. Inside of it I see the following piece of code: /* If it is possible - merge BB with its predecessor. */ if (can_merge_blocks_p (bb->prev_bb, bb)) sel_merge_blocks (bb->prev_bb, bb); During my compiler run I reach the situation in which bb->prev_bb is not in the scheduling scope BUT can_merge_blocks_p (bb->prev_bb, bb) returns true. As a consequence inside sel_remove_empty_bb I hit on the following assert: gcc_assert (in_current_region_p (merge_bb)); The backtrace from the assert is: #0 move_bb_info #1 sel_remove_empty_bb #2 sel_merge_blocks #3 maybe_tidy_empty_bb Perhaps one way to solve the problem is to replace: if (can_merge_blocks_p (bb->prev_bb, bb)) with if (can_merge_blocks_p (bb->prev_bb, bb) && in_current_region_p (bb->prev_bb)) I added this description to buggzila under Bug 53999. regards, Alex
Re: selective scheduler failure
Hi, I tried the patch but it doesnt solve my problem. The patch addresses the problem on the else branch but my problem is on the if: if (can_merge_blocks_p (bb->prev_bb, bb)) sel_merge_blocks ... Alex --- On Tue, 7/17/12, Alexander Monakov wrote: > From: Alexander Monakov > Subject: Re: selective scheduler failure > To: "Alex Turjan" > Cc: gcc@gcc.gnu.org > Date: Tuesday, July 17, 2012, 7:07 PM > > As a consequence inside > sel_remove_empty_bb I hit on the following assert: > > gcc_assert (in_current_region_p (merge_bb)); > > Sounds like your GCC tree does not have Andrey's fix for PR > 52250 (SVN > revision 184975). > > Alexander >
Re: selective scheduler failure
I found the patch from the following link: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52250 But i dont see the removal of the assert in move_bb_info. Perhaps you have another patch related to this bug. thanks Alex --- On Tue, 7/17/12, Alexander Monakov wrote: > From: Alexander Monakov > Subject: Re: selective scheduler failure > To: "Alex Turjan" > Cc: gcc@gcc.gnu.org > Date: Tuesday, July 17, 2012, 7:57 PM > On Tue, Jul 17, 2012 at 8:39 PM, Alex > Turjan > wrote: > > > > Hi, > > I tried the patch but it doesnt solve my problem. > > The patch addresses the problem on the else branch but > my problem is on the if: > > if (can_merge_blocks_p (bb->prev_bb, bb)) > > sel_merge_blocks ... > > I don't understand. You said earlier that the assert in > move_bb_info > failed, but the patch removes that assert. So how does the > failure > look now? Can you provide a reproducible testcase? > > Alexander >
Re: selective scheduler failure
I applied the new patch and it seams fine. Thanks! Alex --- On Wed, 7/18/12, Alexander Monakov wrote: > From: Alexander Monakov > Subject: Re: selective scheduler failure > To: "Alex Turjan" > Cc: "Alexander Monakov" , gcc@gcc.gnu.org > Date: Wednesday, July 18, 2012, 1:42 AM > On Wed, Jul 18, 2012 at 1:05 AM, Alex > Turjan > wrote: > > I found the patch from the following link: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52250 > > See comment 9. The patch pasted in the audit trail is not > what has > been committed. > > Alexander >
SMS issues
Im writing to you with respect to some strange SMS functionality. In the code bellow there are 2 instructions (a builtin store and a builtin load) as they appear in the program flow before SMS: (insn 134 133 135 12 tdscdma_pfu_ccdec.c:289 (set (mem:HI (plus:PSI (reg/v/f:PSI 185 [ ccdecInterim_pi16 ]) (reg:SI 303)) [0 S2 A16]) (unspec:HI [ (reg/v:HI 244 [ outData_u16 ]) ] 1752)) 1377 {INSN_BUILTIN___storebyteofs_16} (expr_list:REG_DEAD (reg:SI 303) (expr_list:REG_DEAD (reg/v:HI 244 [ outData_u16 ]) (nil (insn 136 135 137 12 tdscdma_pfu_ccdec.c:292 (set (reg/v:HI 248 [ mappingAddress_i16 ]) (unspec:HI [ (mem:HI (plus:PSI (reg/v/f:PSI 170 [ curMappingTable_pi16 ]) (reg:SI 305)) [0 S2 A16]) ] 696)) 755 {INSN_BUILTIN___loadbyteofs_16} (expr_list:REG_DEAD (reg:SI 305) (nil))) I observed that in ddg.c there are 3 dependencies build between insn 134 and 136: 1. (T,0) - a true dep of distance 0. 2. (T,1) - a true dep of distance 1. 3. (A,1) - an anti dep of distance 1. Due to (T,1) dependence, further on during the calling of generate_reg_moves: 1. there is created a (modulo expansion) move instruction (insn 207 132 134 13 (set (reg:HI 326) (mem:HI (plus:PSI (reg/v/f:PSI 185 [ ccdecInterim_pi16 ]) (reg:SI 303)) [0 S2 A16])) -1 (nil)) 2. while generate_reg_moves attempts to replace in insn 136 the memory access (mem:HI (plus:PSI (reg/v/f:PSI 170 [ curMappingTable_pi16 ]) (reg:SI 305)) [0 S2 A16]) with the register (reg:HI 326) (the same as the one set by insn 207). But the replacement fails so insn 136 remains unchanged. Remark that later on insn 207 gets removed (during ira) Issues: 1. What is the reason why (T,1) is build up? – to me it seams that (T,0) must be enough 2. Looking inside generate_reg_moves it seams to me that this function is not meant to deal with replacing memory accesses but only with register replacements. (see inside the call to replace_rtx which in my case trys to replace the mem accesses inside 136). 3. The (T,1) dep is assumed to take place as if before the SMS pass, insn 136 was preceding insn 134: (insn 136 135 137 12 tdscdma_pfu_ccdec.c:292 (set (reg/v:HI 248 [ mappingAddress_i16 ]) (unspec:HI [ (mem:HI (plus:PSI (reg/v/f:PSI 170 [ curMappingTable_pi16 ]) (reg:SI 305)) [0 S2 A16]) ] 696)) 755 {INSN_BUILTIN___loadbyteofs_16} (expr_list:REG_DEAD (reg:SI 305) (nil))) (insn 134 133 135 12 tdscdma_pfu_ccdec.c:289 (set (mem:HI (plus:PSI (reg/v/f:PSI 185 [ ccdecInterim_pi16 ]) (reg:SI 303)) [0 S2 A16]) (unspec:HI [ (reg/v:HI 244 [ outData_u16 ]) ] 1752)) 1377 {INSN_BUILTIN___storebyteofs_16} (expr_list:REG_DEAD (reg:SI 303) (expr_list:REG_DEAD (reg/v:HI 244 [ outData_u16 ]) (nil If that would be the case then between 134 and 136 there would be present also an antidependence of distance 0. Becasue in the pipelined schedule, 134 is scheduled before 136 (SCHED_TIME (136) > SCHED_TIME (134)) the modulo variable expansion needs to take place as explained before. SMS decides to produce a modulo variable expansion in a case when is not needed. However, it fails in fulfilling the whole modulo variable expansion procedure, covering in this way the possibly incorrect behavior described above. regards, Alex
Re: SMS issues
Andrey, Thanks for the patch. I applied it and so far it seams ok. I will run further testing and let you know if i see problems. Back to the last part of my email, Im still wondering what happens in case the variable modulo expanded is a memory location? because as I see generate_reg_moves is not able to handle such situation... or perhaps there is something which prevents the modulo scheduler from arriving to this situation? Alex --- On Thu, 7/19/12, Andrey Belevantsev wrote: > From: Andrey Belevantsev > Subject: Re: SMS issues > To: "Alex Turjan" > Cc: gcc@gcc.gnu.org, ayal.z...@gmail.com, revital.e...@linaro.org, "Roman > Zhuikov" > Date: Thursday, July 19, 2012, 11:11 AM > Hello Alex, > > On 18.07.2012 18:40, Alex Turjan wrote: > > > > Im writing to you with respect to some strange SMS > functionality. > > In the code bellow there are 2 instructions (a builtin > store and a builtin load) > > as they appear in the program flow before SMS: > > ... > > > Issues: > > 1. What is the reason why (T,1) is > build up? – to me it seams that (T,0) > > must be enough > > This looks like the issue that Roman's patch from > http://gcc.gnu.org/ml/gcc-patches/2011-07/msg01804.html > should be fixing, > could you try it? > > Ayal, Revital, could you again take a look at the above > patch and all the > SMS improvement patches mentioned in > http://gcc.gnu.org/ml/gcc-patches/2012-03/msg01859.html? > The last comments > from me are at http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00478.html. > At > the Cauldron, I was talking to Ramana about pushing these > forward as > important for arm and Linaro, so it would be good to have > them in 4.8. > > Andrey > > > 2. Looking inside generate_reg_moves > it seams to me that this function > > is not meant to deal with replacing > memory accesses but only with > > register replacements. (see inside > the call to replace_rtx which in > > my case trys to replace the > mem accesses inside 136). > > > > 3. The (T,1) dep is assumed to take place as if before > the SMS pass, > > insn 136 was preceding insn 134: > > > > (insn 136 135 137 12 tdscdma_pfu_ccdec.c:292 > > (set (reg/v:HI 248 [ > mappingAddress_i16 ]) > > (unspec:HI [ > > > (mem:HI (plus:PSI (reg/v/f:PSI 170 [ > curMappingTable_pi16 ]) > > > (reg:SI 305)) [0 S2 > A16]) > > ] 696)) > 755 {INSN_BUILTIN___loadbyteofs_16} (expr_list:REG_DEAD > (reg:SI 305) > > (nil))) > > > > (insn 134 133 135 12 tdscdma_pfu_ccdec.c:289 > > (set (mem:HI (plus:PSI (reg/v/f:PSI > 185 [ ccdecInterim_pi16 ]) > > > (reg:SI 303)) [0 S2 A16]) > > (unspec:HI [ > > > (reg/v:HI 244 [ outData_u16 ]) > > ] > 1752)) 1377 {INSN_BUILTIN___storebyteofs_16} > (expr_list:REG_DEAD (reg:SI 303) > > (expr_list:REG_DEAD > (reg/v:HI 244 [ outData_u16 ]) > > (nil > > > > If that would be the case then between 134 and 136 > there would be present > > also an antidependence of distance 0. Becasue in the > pipelined schedule, > > 134 is scheduled before 136 (SCHED_TIME (136) > > SCHED_TIME (134)) the modulo > > variable expansion needs to take place as explained > before. > > > > SMS decides to produce a modulo variable expansion in a > case when is not > > needed. However, it fails in fulfilling the whole > modulo variable expansion > > procedure, covering in this way the possibly incorrect > behavior described above. > > > > regards, > > Alex > > > >
RTL alias analysis
Hi everybody, Im interested whether there is gcc support (at the RTL level) after unrolling which allows to disambiguate memory accesses present in RTL constructs (specially among memory accesses generated due to unrolling) for machines that do not support indexed addressing modes. I am aware of an article by Sanjiv Gupta and Naveen Sharma on alias analysis at the RTL level, which was published in the GCC 2003 summit. Are there any further gcc developments of this work or is there a patch concerning this work? thanks in advance, Alex Turjan No Cost - Get a month of Blockbuster Total Access now. Sweet deal for Yahoo! users and friends. http://tc.deals.yahoo.com/tc/blockbuster/text1.com
RE: Implement #pragma unroll?
Dear Bingfeng, Some time ago I had to deal with a similar issue as you. Basically I did as follows: I built a backend function which catches the unroll pragma and replaces it with a target assembly intrinsic (which of course has to be described in an .md file). After that in the RTL unroll phase, I analyze loop by loop and connect them to the corresponding unrolling intrinsic (which as a field contains the unrolling factor, you may add here extra information which allows you to recognize the loop) from where I decide the unrolling factor. After that in the RTL unroll phase I remove all the unroll intrinsics. hope this will help, Alex
RE: Implement #pragma unroll?
Bingfeng, to define the pragma you use something like this: #define REGISTER_TARGET_PRAGMAS() { c_register_pragma (0, "unroll_factor", support_unroll_pragma); } such that when you compile a loop like this you get it unrolled 20 times: #pragma unroll_factor=20 for(i = 0; i < 40; i++){ m[i]=3; } then you have to implement the support_unroll_pragma function which in my case I used it to extract the unrolling factor (in this case 20) and to insert in the parse tree an intrinsic. Here is my code: void support_unroll_pragma (cpp_reader * pfile ATTRIBUTE_UNUSED) { tree x; enum cpp_ttype token; int unroll_factor = 0; token = pragma_lex (&x); gcc_assert (token == CPP_EQ); token = pragma_lex (&x); gcc_assert (token == CPP_NUMBER); unroll_factor = TREE_INT_CST_LOW (x); tree type = build_function_type_list (void_type_node, integer_type_node, NULL_TREE); tree t = build_fn_decl ("__unroll_pragma", type); enum built_in_function code = DECL_FUNCTION_CODE (t); DECL_BUILT_IN_CLASS (t) = BUILT_IN_MD; TREE_PUBLIC (t) = 1; DECL_EXTERNAL (t) = 1; DECL_FUNCTION_CODE (t) = builtin_line_no("__unroll_pragma"); tree operand = build_int_cst (NULL_TREE, unroll_factor); t = build_call_expr (t, 1, operand); add_stmt (t); } trimedia_builtin_line_no generates a call described by the following pattern: (define_insn "customop_unroll_pragma" [(unspec_volatile:SI [ (match_operand:SI 0 "immediate_operand" "i") ] UNSPEC_unroll_pragma) ] "" "" ) which in loop-unroll (where you have RTL instructions) is caught with the with something like this: (INSN_P(insn) && recog_memoized(insn) == CODE_FOR_customop_unroll_pragma more or less this is it. Alex --- Bingfeng Mei <[EMAIL PROTECTED]> wrote: > Alex, Thanks for your suggestion. What target hook > do you use for the > backend function? > > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > Alex Turjan > Sent: 29 May 2008 14:45 > To: gcc@gcc.gnu.org > Subject: RE: Implement #pragma unroll? > > Dear Bingfeng, > Some time ago I had to deal with a similar issue as > you. Basically I did as follows: I built a backend > function which catches the unroll pragma and > replaces > it with a target assembly intrinsic (which of course > has to be described in an .md file). After that in > the > RTL unroll phase, I analyze loop by loop and connect > them to the corresponding unrolling intrinsic (which > as a field contains the unrolling factor, you may > add > here extra information which allows you to recognize > the loop) from where I decide the unrolling factor. > After that in the RTL unroll phase I remove all the > unroll intrinsics. > hope this will help, > Alex > > > > > > >