Re: Question about find modifiable mems
On 04-Jun-15 03:54 AM, Jim Wilson wrote: On 06/02/2015 11:39 PM, shmeel gutl wrote: find_modifiable_mems was introduced to gcc 4.8 in september 2012. Is there any documentation as to how it is supposed to help the haifa scheduler? The patch was submitted here https://gcc.gnu.org/ml/gcc-patches/2012-08/msg00155.html and this message contains a brief explanation of what it is supposed to do. The explanation looks like a useful optimization, but perhaps it is triggering in cases when it shouldn't. Jim Thanks, this is what I was looking for. From the comments, he didn't intend to do what I saw. Probably, the problem is in my port and the very special way that we handle instruction costs. If I see a problem that isn't specific to my port, I will report back. Shmeel
porting to lra
are there any guidelines as to what needs to be done in the backend to enable lra for 5.2? when I turn it on I get two types of errors. 1) insn not recognized because fp hasn't been converted yet, and 2) max number of generated reload insns. any pointers will be appreciated shmeel
Problem with tree pass pre
When dealing with an array with known values, pre will evaluate the first iteration of a loop over the elements. The code generator with then jump into the loop. This is at best increasing the size of the code. It also creates inferior code when the hardware supports zero overhead loops. The attached code demonstrates the difference between an unknown array and a known array. The loop size has been picked large enough for cunrolli to not fully unroll the loop. The problem did not exist in gcc 4.8. extern int B[27]; int foo() { int i; int t=0; for(i=0;i<27;i++) t+=B[i]; return t; } int boo() { int i; int t=0; static int A[] = {1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9}; for(i=0;i<27;i++) t+=A[i]; return t; }
Re: Problem with tree pass pre
On 31-Aug-15 02:19 PM, Richard Biener wrote: On Mon, Aug 31, 2015 at 6:51 AM, shmeel gutl wrote: When dealing with an array with known values, pre will evaluate the first iteration of a loop over the elements. The code generator with then jump into the loop. This is at best increasing the size of the code. It also creates inferior code when the hardware supports zero overhead loops. The attached code demonstrates the difference between an unknown array and a known array. The loop size has been picked large enough for cunrolli to not fully unroll the loop. The problem did not exist in gcc 4.8. I think you were just lucky with GCC 4.8 - the issue is present since forever. Basically it's because we treat a constant as available. So PRE might end up rotating the loop, inserting the 2nd iteration on the latch edge. Unfortunately this transform sometimes improves code-gen, it would be quite simple to disallow this kind of transform generally though. Richard. It seems to be a bad optimization for zero overhead loops and/or software pipelining. Can it be disabled when these features are available. Shmeel
Re: Acceptance criteria for the git conversion
On 01-Sep-15 01:54 PM, Eric S. Raymond wrote: What kind of mechanical transformation or hand-editing would add value for you? I am working from a clone of the current git repository. Is there an automated procedure that will enable me to switch to the new repository and still keep all of the commit history of my local branches?
Help porting to lra
I am trying to enable lra for my vliw architecture and I am encountering a problem of "max number of generated reload insns". The problem seems elementary but I don't see the correction. Consider r1 =r2+r3 s1=r1+r4 call func(r3) r5=s2+r1 where s registers are pseudo registers which ira maps to callee saved registers and r registers are pseudo registers which ira maps to caller saved registers. The inheritance pass sees that r1 is still live across a call so it generates a spill using split_reg. Call_save_p is true so it spills to a pseudo register which ends up getting the same hard register assignment as r1. Therefore nothing is solved, the new register is also live across the call. The call to the emit_spill_move looks like it is expecting a memory destination but it is in fact receiving a pseudo register. Did I miss some kind of hook that makes the spill go to the stack? Reload gets it right. Thanks, Shmeel
Re: Question about find modifiable mems
On 03-Jun-15 09:39 AM, shmeel gutl wrote: find_modifiable_mems was introduced to gcc 4.8 in september 2012. Is there any documentation as to how it is supposed to help the haifa scheduler? In my private port of gcc it make the following type of transformations from a= *(b+20) b+=30 to b+=30 a=*(b-10) Although this is functionally correct, it has changed an ANTI_DEP into a TRUE_DEP and thus introduced stalls. If it went the other way, that would be good. Any pointers? Thanks, Shmeel It seems that the problem comes from the change from ANTI_DEP to TRUE_DEP. The flow graph needs to be updated to reflect this change. Can someone look into this? Shmeel
Help with lra
I am trying to enable lra for a propriety backend. I ran into one problem that I can't solve. In lra-constraints.c:split_reg lra_create_new_reg can be called with a hard code rclass of NO_REGS. It then queues a move instruction of the type set TYPE:new_reg TYPE:old_reg But the NO_REGS rclass stops new_reg from matching a register constraint and forces a reload. But the reload will have the same problem. This recurses until the recursion limit is hit. What is my backend missing that will allow a register assignment to new_reg? Thanks Shmeel
Re: Help with lra
On 03-Aug-16 12:10 AM, Vladimir Makarov wrote: On 08/02/2016 04:41 PM, shmeel gutl wrote: I am trying to enable lra for a propriety backend. I ran into one problem that I can't solve. In lra-constraints.c:split_reg lra_create_new_reg can be called with a hard code rclass of NO_REGS. It then queues a move instruction of the type set TYPE:new_reg TYPE:old_reg But the NO_REGS rclass stops new_reg from matching a register constraint and forces a reload. But the reload will have the same problem. This recurses until the recursion limit is hit. What is my backend missing that will allow a register assignment to new_reg? NO_REGS in this case means memory and the generated RTL move insn finally should be a target load or store insn. It is hard to say w/o looking at the code but, probably, your move insn descriptions do not have memory constraints (or these constraints are quite specific). Currently our memory constraints only match memory operands. I assume that you are suggesting that pseudo registers should match memory constraints. Is this true only for lra, or, would reload also benefit from such a change? Would other passes gain by such a change? Is any extra support needed in patterns or hooks? Thanks, Shmeel
Re: Help with lra
On 10-Aug-16 08:41 PM, Vladimir N Makarov wrote: On 08/09/2016 12:33 AM, shmeel gutl wrote: On 03-Aug-16 12:10 AM, Vladimir Makarov wrote: On 08/02/2016 04:41 PM, shmeel gutl wrote: I am trying to enable lra for a propriety backend. I ran into one problem that I can't solve. In lra-constraints.c:split_reg lra_create_new_reg can be called with a hard code rclass of NO_REGS. It then queues a move instruction of the type set TYPE:new_reg TYPE:old_reg But the NO_REGS rclass stops new_reg from matching a register constraint and forces a reload. But the reload will have the same problem. This recurses until the recursion limit is hit. What is my backend missing that will allow a register assignment to new_reg? NO_REGS in this case means memory and the generated RTL move insn finally should be a target load or store insn. It is hard to say w/o looking at the code but, probably, your move insn descriptions do not have memory constraints (or these constraints are quite specific). Currently our memory constraints only match memory operands. I assume that you are suggesting that pseudo registers should match memory constraints. Is this true only for lra, or, would reload also benefit from such a change? Would other passes gain by such a change? Is any extra support needed in patterns or hooks? Move insn descriptions are quit specific. When you make a port it is better to have only one move insn for given mode (although there are some tricks to avoid this). Therefore move insns have a lot of alternatives. That is what I meant. As for memory constraint you should not to return true for a pseudo. Reload/LRA can figure out how to match a spilled pseudo with memory (but this constraint should be define_memory_constraint, i saw mistakes when people used different forms of constraints for memory and had problems). Again it is hard to say something definite w/o seeing the code what is the actual problem. You hit it on the head with define_memory_constraint. Reload didn't seem to need it but LRA does. Thank you, Shmeel
Re: Help with lra
On 10-Aug-16 08:41 PM, Vladimir N Makarov wrote: As for memory constraint you should not to return true for a pseudo. Reload/LRA can figure out how to match a spilled pseudo with memory (but this constraint should be define_memory_constraint, i saw mistakes when people used different forms of constraints for memory and had problems). For its own reasons, xtensa returns true for a pseudo during reload for a memory type constraint that shouldn't use the constant pool but doesn't mark it as a memory constraint. What will that architecture lose because of that and will it work for lra?
negative latencies
Are there hooks in gcc to deal with negative latencies? In other words, an architecture that permits an instruction to use a result from an instruction that will be issued later. At first glance it seems that it will will break a few things. 1) The definition of dependencies cannot come from the simple ordering of rtl. 2) The scheduling problem starts to look like "get off the train 3 stops before me". 3) The definition of live ranges needs to use actual instruction timing information, not just instruction sequencing. The hooks in the scheduler seem to be enough to stop damage but not enough to take advantage of this "feature". Thanks
Re: negative latencies
On 19-May-14 09:39 AM, Andrew Pinski wrote: On Sun, May 18, 2014 at 11:13 PM, shmeel gutl wrote: Are there hooks in gcc to deal with negative latencies? In other words, an architecture that permits an instruction to use a result from an instruction that will be issued later. Do you mean bypasses? If so there is a bypass feature which you can use: https://gcc.gnu.org/onlinedocs/gccint/Processor-pipeline-description.html#index-data-bypass-3773 Thanks, Andrew Pinski Unfortunately, bypasses in the pipeline description is not enough. They only allow you to calculate the latency of true dependencies. They are also forced to be zero or greater. The real question is how the scheduler and register allocator can deal with negative latencies. Thanks Shmeel At first glance it seems that it will will break a few things. 1) The definition of dependencies cannot come from the simple ordering of rtl. 2) The scheduling problem starts to look like "get off the train 3 stops before me". 3) The definition of live ranges needs to use actual instruction timing information, not just instruction sequencing. The hooks in the scheduler seem to be enough to stop damage but not enough to take advantage of this "feature". Thanks - No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4577 / Virus Database: 3950/7515 - Release Date: 05/18/14
Re: negative latencies
On 19-May-14 01:02 PM, Ajit Kumar Agarwal wrote: Is it the case of code speculation where the negative latencies are used? No. It is an exposed pipeline where instructions read registers during the required cycle. So if one instruction produces its results in the third pipeline stage and a second instruction reads the register in the sixth pipeline stage. The second instruction can read the results of the first instruction even if it is issued three cycles earlier. Thanks & Regards Ajit -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of shmeel gutl Sent: Monday, May 19, 2014 12:23 PM To: Andrew Pinski Cc: gcc@gcc.gnu.org; Vladimir Makarov Subject: Re: negative latencies On 19-May-14 09:39 AM, Andrew Pinski wrote: On Sun, May 18, 2014 at 11:13 PM, shmeel gutl wrote: Are there hooks in gcc to deal with negative latencies? In other words, an architecture that permits an instruction to use a result from an instruction that will be issued later. Do you mean bypasses? If so there is a bypass feature which you can use: https://gcc.gnu.org/onlinedocs/gccint/Processor-pipeline-description.h tml#index-data-bypass-3773 Thanks, Andrew Pinski Unfortunately, bypasses in the pipeline description is not enough. They only allow you to calculate the latency of true dependencies. They are also forced to be zero or greater. The real question is how the scheduler and register allocator can deal with negative latencies. Thanks Shmeel At first glance it seems that it will will break a few things. 1) The definition of dependencies cannot come from the simple ordering of rtl. 2) The scheduling problem starts to look like "get off the train 3 stops before me". 3) The definition of live ranges needs to use actual instruction timing information, not just instruction sequencing. The hooks in the scheduler seem to be enough to stop damage but not enough to take advantage of this "feature". Thanks - No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4577 / Virus Database: 3950/7515 - Release Date: 05/18/14 - No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4577 / Virus Database: 3950/7517 - Release Date: 05/18/14
Re: negative latencies
On 20-May-14 06:13 PM, Vladimir Makarov wrote: On 05/19/2014 02:13 AM, shmeel gutl wrote: Are there hooks in gcc to deal with negative latencies? In other words, an architecture that permits an instruction to use a result from an instruction that will be issued later. Could you explain more on *an example* what are you trying to achieve with the negative latency. Scheduler is based on a critical path algorithm. Generally speaking latency time can be negative for this algorithm. But I guess that is not what you are asking. The architecture has an exposed pipeline where instructions read registers during the required cycle. So if one instruction produces its results in the third pipeline stage and a second instruction reads the register in the sixth pipeline stage, the second instruction can read the results of the first instruction even if it is issued three cycles earlier. The problem that I see is that the haifa scheduler schedules one cycle at a time, in a forward order, by picking from a list of instructions that can be scheduled without delays. So, in the above example, if instruction one is scheduled during cycle 3, it can't schedule instruction two during cycle 0, 1, or 2 because its producer dependency (instruction one) hasn't been scheduled yet. It won't be able to schedule it until cycle 3. So I am asking if there is an existing mechanism to back schedule instruction two once instruction one is issued. Thanks, Shmeel At first glance it seems that it will will break a few things. 1) The definition of dependencies cannot come from the simple ordering of rtl. 2) The scheduling problem starts to look like "get off the train 3 stops before me". 3) The definition of live ranges needs to use actual instruction timing information, not just instruction sequencing. The hooks in the scheduler seem to be enough to stop damage but not enough to take advantage of this "feature".
Re: negative latencies
On 21-May-14 06:30 PM, Vladimir Makarov wrote: I am just curious what happens when you put insn2, insn1. and insn2 uses a result of insn1 in 6 cycles and insn1 producing the result in 3 cycles, but there are not ready functional units (e.g. arithmentic units) necessary for insn1 for 4 or more cycles. It is quite not trivial to guarantee that everything will be okay in general case if you put insn2 before insn1. This is not a problem for this architecture. The units are fully pipelined and the only conflicts are in the first stage, during instruction issue. That is, the vliw must be legal. The gcc dfa handles this case fine.
Re: negative latencies
On 22-May-14 07:21 PM, Bernd Schmidt wrote: On 05/21/2014 05:30 PM, Vladimir Makarov wrote: On 2014-05-20, 5:18 PM, shmeel gutl wrote: The problem that I see is that the haifa scheduler schedules one cycle at a time, in a forward order, by picking from a list of instructions that can be scheduled without delays. So, in the above example, if instruction one is scheduled during cycle 3, it can't schedule instruction two during cycle 0, 1, or 2 because its producer dependency (instruction one) hasn't been scheduled yet. It won't be able to schedule it until cycle 3. So I am asking if there is an existing mechanism to back schedule instruction two once instruction one is issued. I see, thanks. There is no such mechanism in the current insn scheduler. Well, the scheduler has support for an exposed pipeline that is used by the C6X port. Insns are split into multiple pieces which are forced to be scheduled at a fixed distance in time from each other, each piece describing the effects that occur at that point in time. This could probably be made to work for this target's requirements, but it might run quite slowly. Bernd Exposed pipeline is not my problem. Negative latency is my problem. I don't see negative latency for c6x, not in unit reservations and not in adjust cost. Did I miss something? Shmeel
Re: negative latencies
On 23-May-14 01:59 PM, Bernd Schmidt wrote: On 05/23/2014 10:07 AM, shmeel gutl wrote: Exposed pipeline is not my problem. Negative latency is my problem. I don't see negative latency for c6x, not in unit reservations and not in adjust cost. Did I miss something? You just need to model it differently. Rather than saying instruction A has a negative latency relative to instruction B, you need to describe that instruction B reads its inputs later than when it is actually issued. The mechanism used in the C6X backend is the scheduler's record_delay_slot_pair function. The scheduler would see B is issued (*) A is issued, executes and writes its outputs B reads its inputs (*) The two insns marked as (*) would be such a delay pair. The first one would generate code, the second one exists only for the purposes of building the right scheduling dependencies. Bernd Okay, I think that I have the idea. But I would still need to backtrack if the enabling instruction is not issued on time. I would also need to delay the dependent instruction if I can see in advance that the producer cannot be issued on time. And, as Vladimir pointed out, I need to watch out for various passes inserting unwanted instructions. Sounds like a big project.
Re: negative latencies
On 23-May-14 05:20 PM, Vladimir Makarov wrote: On 2014-05-23, 3:49 AM, shmeel gutl wrote: On 21-May-14 06:30 PM, Vladimir Makarov wrote: I am just curious what happens when you put insn2, insn1. and insn2 uses a result of insn1 in 6 cycles and insn1 producing the result in 3 cycles, but there are not ready functional units (e.g. arithmentic units) necessary for insn1 for 4 or more cycles. It is quite not trivial to guarantee that everything will be okay in general case if you put insn2 before insn1. This is not a problem for this architecture. The units are fully pipelined and the only conflicts are in the first stage, during instruction issue. That is, the vliw must be legal. The gcc dfa handles this case fine. Another problem is that besides insn-scheduler there are a lot of optimizations which can insert some insns between the two insns after scheduling. In this case the result might be not ready for insn2. So you at least should exclude 1st insn scheduling (before RA) and make 2nd insn scheduling as the very last pass. In general, a traditional approach is to do such things on assembler level (e.g. as for older MIPS processor without hardware interlocks). There is also IRA. I would need to tweak the definition of live range.
Re: Using associativity for optimization
It works fine for my test case. Thanks Richard Biener wrote: >On Tue, Dec 2, 2014 at 12:11 AM, shmeel gutl > wrote: >> While testing my implementation of passing arguments in registers, I noticed >> that gcc 4.7 creates instruction dependencies when it doesn't have to. >> Consider: >> >> int foo(int a1, int a2, int a3, int a4) >> { >> return a1|a2|a3|a4; >> } >> >> gcc, even with -O2 generated code that was equivalent to >> >> temp1 = a1 | a2; >> temp2 = temp1 | a3; >> temp3 = temp2 | a4; >> >> return temp3; >> >> This code must be executed serially. >> >> Could I create patterns, or enable optimizations that would cause the >> compiler to generate >> >> temp1 = a1 | a2; >> temp2 = a3 | a4; >> temp3 = temp1 | temp2; >> >> Thereby allowing the scheduler to compute temp1 and temp2 in parallel. > >You can tune it with --param tree-reassoc-width=N, not sure if that >was implemented for 4.7 already. > >Richard.
Re: Using associativity for optimization
On 02-Dec-14 12:23 PM, Richard Biener wrote: On Tue, Dec 2, 2014 at 12:11 AM, shmeel gutl wrote: While testing my implementation of passing arguments in registers, I noticed that gcc 4.7 creates instruction dependencies when it doesn't have to. Consider: int foo(int a1, int a2, int a3, int a4) { return a1|a2|a3|a4; } gcc, even with -O2 generated code that was equivalent to temp1 = a1 | a2; temp2 = temp1 | a3; temp3 = temp2 | a4; return temp3; This code must be executed serially. Could I create patterns, or enable optimizations that would cause the compiler to generate temp1 = a1 | a2; temp2 = a3 | a4; temp3 = temp1 | temp2; Thereby allowing the scheduler to compute temp1 and temp2 in parallel. You can tune it with --param tree-reassoc-width=N, not sure if that was implemented for 4.7 already. Richard. Works fine for this test case Thanks
Re: A Question About LRA/reload
On 09-Dec-14 07:56 PM, Jeff Law wrote: On 12/09/14 10:10, Vladimir Makarov wrote: generate the correct code in many cases even for x86. Jeff Law tried IRA coloring reusage too for reload but whole RA became slower (although he achieved performance improvements on x86). Right. After IRA was complete, I'd walk over the unallocated allocnos and split their ranges at EBB boundaries. That created new allocnos with a smaller conflict set and reduced the conflict set for the original unallocated allocnos. After I'd done that splitting for all the EBBs, I called back into ira_reassign_pseudos to try to assign the original unallocated allocnos as well as the new allocnos. To get good results, much of IRA's cost analysis had to be redone from scratch. And from a compile-time standpoint, that's a killer. The other approach I was looking at was a backwards walk through each block. When I found an insn with an unallocated pseudo that would trigger one of various range spliting techniques to try and free up a hard register. Then again I'd call into ira_reassign_pseudos to try the allocations again. This got even better results, but was obviously even more compile-time expensive. I don't think much, if any, of that work is relevant given the current structure and effectiveness of LRA. jeff Are any of these versions available? I wouldn't mind a 50% penalty in compile time if it gave me a 5% improvement in the generated code.
rnreg and vliw
It seems that in gcc 4.7, the rnreg pass for renaming registers after reload is not vliw aware. In particular I saw it reassign a register that is in use in the same vliw. To be more concrete, I saw it change the following pseudo code DI:a30 = v0 SI:a14 = -a14 to DI:a30 = v0 SI:a31 = -a14 since a31 was never referenced again. This won't work inside a vliw since it causes two instructions to set a31. Even though, rnreg runs before sched2, it runs after software pipelining which creates its own vliws. Is there any easy fix for this. Thanks, Shmeel
Re: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE
On 16-May-15 03:49 PM, Ajit Kumar Agarwal wrote: if (loop && loop->latch == bb || loop->header == bb) Please add parenthesis to the various occurrences of this code fragment. Better if the precedence is explicit.
Question about find modifiable mems
find_modifiable_mems was introduced to gcc 4.8 in september 2012. Is there any documentation as to how it is supposed to help the haifa scheduler? In my private port of gcc it make the following type of transformations from a= *(b+20) b+=30 to b+=30 a=*(b-10) Although this is functionally correct, it has changed an ANTI_DEP into a TRUE_DEP and thus introduced stalls. If it went the other way, that would be good. Any pointers? Thanks, Shmeel
Memory dependence
In the architecture that I am using, there is a big pipeline penalty for read after write to the same memory location. Is it possible to tell the difference between a possible memory conflict and a definite memory conflict?
Testing timing tables
The latency calculations in my backend are very complicated. Is there any automated way to test them?
conflict between scheduler and register allocator
I am having trouble meeting the constraints of the scheduler and the register allocator for my back end. The relevant features are: 1) VLIW - up to 4 instructions can be issued each cycle 2) If a vliw bundle has both a set and a use, the use will use the old values. 3) A call instruction will push r30 and r31 to the stack making them natural candidates for callee saved. The problem is that the scheduler might include an instruction that sets r30 in the same vliw as a call. This would result in a stale value being saved to the stack. (Note: the call instruction is not truly dependent on r30, just that r30 can't be set in the vliw that contains the call). On the other hand, if I declare that the call uses r30, the register allocator will refuse to use r30 since it thinks that the register is live. I know that I can use a hook to fix-up the first problem by breaking a single vliw into two bundles, but that has a performance penalty. Is there a way to tell the scheduler to avoid issuing an instruction that sets a30 or a31 in the same bundle that contains a call instruction? Thank you for any pointers.
Re: conflict between scheduler and register allocator
On 09-Aug-13 07:35 PM, Vladimir Makarov wrote: On 13-08-09 7:25 AM, shmeel gutl wrote: I am having trouble meeting the constraints of the scheduler and the register allocator for my back end. The relevant features are: 1) VLIW - up to 4 instructions can be issued each cycle 2) If a vliw bundle has both a set and a use, the use will use the old values. 3) A call instruction will push r30 and r31 to the stack making them natural candidates for callee saved. The problem is that the scheduler might include an instruction that sets r30 in the same vliw as a call. This would result in a stale value being saved to the stack. (Note: the call instruction is not truly dependent on r30, just that r30 can't be set in the vliw that contains the call). On the other hand, if I declare that the call uses r30, the register allocator will refuse to use r30 since it thinks that the register is live. I know that I can use a hook to fix-up the first problem by breaking a single vliw into two bundles, but that has a performance penalty. Is there a way to tell the scheduler to avoid issuing an instruction that sets a30 or a31 in the same bundle that contains a call instruction? Thank you for any pointers. You should look at haifa-sched.c::schedule_block. There are a lot of hooks called at different stages of list scheduling algorithm. Depending on what the algorithm stage you want to do this, you can use a specific hook. I'd pay attention to targetm.sched.reorder[2]. You also can look at the hooks implemented for IA64 as it is most widely used VLIW architecture for now. But implementation of some IA64 hook are pretty big. Thank you. reorder2 does indeed let me block conflicting insns from being issued during the same cycle.
avoiding extra .loc directives
For my VLIW toolchain, I am not allowed to output .loc directives in the middle of a VLIW bundle. Following the lead of the bfin backend, I scan bundles during the machine reorg pass and ensure that all of the insns of a bundle have the same location. This solves most of the problems, but final.c will also output the .loc directive after encountering a NOTE_INSN_PROLOGUE_END. There are two obvious solutions to this: 1) eliminate the note in machine reorg. 2) eliminate the line "force_source_line = true;" from the appropriate line in final.c. Would either of these alternatives cause a problem? Is there a better way to avoid having .loc directives inside bundles?
Re: Dependency confusion in sched-deps
On 05-Dec-13 02:39 AM, Maxim Kuvyrkov wrote: Dependency type plays a role for estimating costs and latencies between instructions (which affects performance), but using wrong or imprecise dependency type does not affect correctness. On multi-issue architectures it does make a difference. Anti dependence permits the two instructions to be issued during the same cycle whereas true dependency and output dependency would forbid this. Or am I misinterpreting your comment?
Re: Dependency confusion in sched-deps
On 06-Dec-13 01:34 AM, Maxim Kuvyrkov wrote: On 6/12/2013, at 8:44 am, shmeel gutl wrote: On 05-Dec-13 02:39 AM, Maxim Kuvyrkov wrote: Dependency type plays a role for estimating costs and latencies between instructions (which affects performance), but using wrong or imprecise dependency type does not affect correctness. On multi-issue architectures it does make a difference. Anti dependence permits the two instructions to be issued during the same cycle whereas true dependency and output dependency would forbid this. Or am I misinterpreting your comment? On VLIW-flavoured machines without resource conflict checking -- "yes", it is critical not to use anti dependency where an output or true dependency exist. This is the case though, only because these machines do not follow sequential semantics for instruction execution (i.e., effects from previous instructions are not necessarily observed by subsequent instructions on the same/close cycles. On machines with internal resource conflict checking having a wrong type on the dependency should not cause wrong behavior, but "only" suboptimal performance. Thank you, -- Maxim Kuvyrkov www.kugelworks.com Earlier in the thread you wrote Output dependency is the right type (write after write). Anti dependency is write after read, and true dependency is read after write. Should the code be changed to accommodate vliw machines.. It has been there since the module was originally checked into trunk.
exposed pipeline
For the 4.7 branch I only saw one architecture using exposed pipeline. Is there any documentation on the quality of exposed pipeline support? Does the back-end need to do anything special to deal with jumps and returns from calls? Thanks Shmeel
Re: Request for discussion: Rewrite of inline assembler docs
On 28-Mar-14 01:46 PM, Hannes Frederic Sowa wrote: On Fri, Mar 28, 2014 at 09:41:41AM +, Andrew Haley wrote: Ok, I see the problem. Maybe something like this by avoiding the term? Using this clobber causes the compiler to flush all (modified) registers being used to store values which gcc decided to originally allocate in memory before executing the @code{asm} statement. What is true here is that all registers used to cache variables that are reachable from pointers in the program are flushed. Anything that is statically allocated is reachable, as is anything dynamically allocated by malloc; auto variables are not reachable unless their address is taken. One would have to go into detail of various optimizations which could remove the address taking e.g. IMHO the last sentence of the paragraph already deals with this. Bye, Hannes By this wording (gcc decided), the consequences are unpredictable. You should probably mention the registers which will definitely be flushed and thereby limit the ambiguity.