from:"shmeel gutl"

Re: Question about find modifiable mems

2015-06-05 Thread shmeel gutl


On 04-Jun-15 03:54 AM, Jim Wilson wrote:

On 06/02/2015 11:39 PM, shmeel gutl wrote:

find_modifiable_mems was introduced to gcc 4.8 in september 2012. Is
there any documentation as to how it is supposed to help the haifa
scheduler?

The patch was submitted here
   https://gcc.gnu.org/ml/gcc-patches/2012-08/msg00155.html
and this message contains a brief explanation of what it is supposed to
do.  The explanation looks like a useful optimization, but perhaps it is
triggering in cases when it shouldn't.

Jim



Thanks, this is what I was looking for. From the comments, he didn't 
intend to do what I saw. Probably, the problem is in my port and the 
very special way that we handle instruction costs.

If I see a problem that isn't specific to my port, I will report back.
Shmeel

porting to lra

2015-08-24 Thread shmeel gutl


are there any guidelines as to what needs to be done in the backend to
enable lra for 5.2? when I turn it on I get two types of errors. 1) insn
not recognized because fp hasn't been converted yet, and 2) max number
of generated reload insns.

any pointers will be appreciated

shmeel

Problem with tree pass pre

2015-08-30 Thread shmeel gutl

When dealing with an array with known values, pre will evaluate the 
first iteration of a loop over the elements. The code generator with 
then jump into the loop. This is at best increasing the size of the 
code. It also creates inferior code when the hardware supports zero 
overhead loops. The attached code demonstrates the difference between an 
unknown array and a known array. The loop size has been picked large 
enough for cunrolli to not fully unroll the loop. The problem did not 
exist in gcc 4.8.




extern int B[27];
int foo()
{
int i;
int t=0;
for(i=0;i<27;i++)
t+=B[i];
return t;
}
int boo()
{
int i;
int t=0;
static int A[] = 
{1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9};
for(i=0;i<27;i++)
t+=A[i];
return t;
}

Re: Problem with tree pass pre

2015-08-31 Thread shmeel gutl


On 31-Aug-15 02:19 PM, Richard Biener wrote:

On Mon, Aug 31, 2015 at 6:51 AM, shmeel gutl
  wrote:

When dealing with an array with known values, pre will evaluate the first
iteration of a loop over the elements. The code generator with then jump
into the loop. This is at best increasing the size of the code. It also
creates inferior code when the hardware supports zero overhead loops. The
attached code demonstrates the difference between an unknown array and a
known array. The loop size has been picked large enough for cunrolli to not
fully unroll the loop. The problem did not exist in gcc 4.8.

I think you were just lucky with GCC 4.8 - the issue is present since forever.
Basically it's because we treat a constant as available.  So PRE might end
up rotating the loop, inserting the 2nd iteration on the latch edge.

Unfortunately this transform sometimes improves code-gen, it would be
quite simple to disallow this kind of transform generally though.

Richard.

It seems to be a bad optimization for zero overhead loops and/or 
software pipelining. Can it be disabled when these features are available.

Shmeel

Re: Acceptance criteria for the git conversion

2015-09-03 Thread shmeel gutl


On 01-Sep-15 01:54 PM, Eric S. Raymond wrote:

What kind of mechanical transformation or hand-editing would add value for you?
I am working from a clone of the current git repository. Is there an 
automated procedure that will enable me to switch to the new repository 
and still keep all of the commit history of my local branches?

Help porting to lra

2016-02-03 Thread shmeel gutl

I am trying to enable lra for my vliw architecture and I am encountering 
a problem of  "max number of generated reload insns". The problem seems 
elementary but I don't see the correction.


Consider
 r1 =r2+r3
s1=r1+r4
call func(r3)
r5=s2+r1

where s registers are pseudo registers which ira maps to callee saved 
registers
and r registers are pseudo registers which ira maps to caller saved 
registers.
The inheritance pass sees that r1 is still live across a call so it 
generates a spill using split_reg. Call_save_p is true so it spills to a 
pseudo register which ends up getting the same hard register assignment 
as r1. Therefore nothing is solved, the new register is also live across 
the call.


The call to the emit_spill_move looks like it is expecting a memory 
destination but it is in fact receiving a pseudo register. Did I miss 
some kind of hook that makes the spill go to the stack? Reload gets it 
right.


Thanks,
Shmeel

Re: Question about find modifiable mems

2016-07-31 Thread shmeel gutl


On 03-Jun-15 09:39 AM, shmeel gutl wrote:
find_modifiable_mems was introduced to gcc 4.8 in september 2012. Is 
there any documentation as to how it is supposed to help the haifa 
scheduler?


In my private port of gcc it make the following type of transformations

from
a= *(b+20)
b+=30

to
b+=30
a=*(b-10)

Although this is functionally correct, it has changed an ANTI_DEP into 
a TRUE_DEP and thus introduced stalls. If it went the other way, that 
would be good. Any pointers?


Thanks,
Shmeel


It seems that the problem comes from the change from ANTI_DEP to 
TRUE_DEP. The flow graph needs to be updated to reflect this change. Can 
someone look into this?


Shmeel

Help with lra

2016-08-02 Thread shmeel gutl

I am trying to enable lra for a propriety backend. I ran into one 
problem that I can't solve. In lra-constraints.c:split_reg 
lra_create_new_reg can be called with a hard code rclass of NO_REGS. It 
then queues a move instruction of the type


 set TYPE:new_reg  TYPE:old_reg

But the NO_REGS rclass stops new_reg from matching a register constraint 
and forces a reload. But the reload will have the same problem. This 
recurses until the recursion limit is hit.


What is my backend missing that will allow a register assignment to new_reg?

Thanks

Shmeel

Re: Help with lra

2016-08-08 Thread shmeel gutl


On 03-Aug-16 12:10 AM, Vladimir Makarov wrote:

On 08/02/2016 04:41 PM, shmeel gutl wrote:
I am trying to enable lra for a propriety backend. I ran into one 
problem that I can't solve. In lra-constraints.c:split_reg 
lra_create_new_reg can be called with a hard code rclass of NO_REGS. 
It then queues a move instruction of the type


 set TYPE:new_reg  TYPE:old_reg

But the NO_REGS rclass stops new_reg from matching a register 
constraint and forces a reload. But the reload will have the same 
problem. This recurses until the recursion limit is hit.


What is my backend missing that will allow a register assignment to 
new_reg?
NO_REGS in this case means memory and the generated RTL move insn 
finally should be a target load or store insn.  It is hard to say w/o 
looking at the code but, probably, your move insn descriptions do not 
have memory constraints (or these constraints are quite specific).


Currently our memory constraints only match memory operands. I assume 
that you are suggesting that pseudo registers should match memory 
constraints. Is this true only for lra, or, would reload also benefit 
from such a change? Would other passes gain by such a change? Is any 
extra support needed in patterns or hooks?


Thanks,

Shmeel

Re: Help with lra

2016-08-11 Thread shmeel gutl


On 10-Aug-16 08:41 PM, Vladimir N Makarov wrote:



On 08/09/2016 12:33 AM, shmeel gutl wrote:

On 03-Aug-16 12:10 AM, Vladimir Makarov wrote:

On 08/02/2016 04:41 PM, shmeel gutl wrote:
I am trying to enable lra for a propriety backend. I ran into one 
problem that I can't solve. In lra-constraints.c:split_reg 
lra_create_new_reg can be called with a hard code rclass of 
NO_REGS. It then queues a move instruction of the type


 set TYPE:new_reg  TYPE:old_reg

But the NO_REGS rclass stops new_reg from matching a register 
constraint and forces a reload. But the reload will have the same 
problem. This recurses until the recursion limit is hit.


What is my backend missing that will allow a register assignment to 
new_reg?
NO_REGS in this case means memory and the generated RTL move insn 
finally should be a target load or store insn.  It is hard to say 
w/o looking at the code but, probably, your move insn descriptions 
do not have memory constraints (or these constraints are quite 
specific).


Currently our memory constraints only match memory operands. I assume 
that you are suggesting that pseudo registers should match memory 
constraints. Is this true only for lra, or, would reload also benefit 
from such a change? Would other passes gain by such a change? Is any 
extra support needed in patterns or hooks?



Move insn descriptions are quit specific.  When you make a port it is 
better to have only one move insn for given mode (although there are 
some tricks to avoid this).  Therefore move insns have a lot of 
alternatives.  That is what I meant.


As for memory constraint you should not to return true for a pseudo.  
Reload/LRA can figure out how to match a spilled pseudo with memory 
(but this constraint should be define_memory_constraint, i saw 
mistakes when people used different forms of constraints for memory 
and had problems).


Again it is hard to say something definite w/o seeing the code what is 
the actual problem.
You hit it on the head with define_memory_constraint. Reload didn't seem 
to need it but LRA does.

Thank you,
Shmeel

Re: Help with lra

2016-08-24 Thread shmeel gutl


On 10-Aug-16 08:41 PM, Vladimir N Makarov wrote:
As for memory constraint you should not to return true for a pseudo.  
Reload/LRA can figure out how to match a spilled pseudo with memory 
(but this constraint should be define_memory_constraint, i saw 
mistakes when people used different forms of constraints for memory 
and had problems). 


For its own reasons, xtensa returns true for a pseudo during reload for 
a memory type constraint that shouldn't use the constant pool but 
doesn't mark it as a memory constraint. What will that architecture lose 
because of that and will it work for lra?

negative latencies

2014-05-18 Thread shmeel gutl

Are there hooks in gcc to deal with negative latencies? In other words, 
an architecture that permits an instruction to use a result from an 
instruction that will be issued later.


At first glance it seems that it will will break a few things.
1) The definition of dependencies cannot come from the simple ordering 
of rtl.
2) The scheduling problem starts to look like "get off the train 3 stops 
before me".
3) The definition of live ranges needs to use actual instruction timing 
information, not just instruction sequencing.


The hooks in the scheduler seem to be enough to stop damage but not 
enough to take advantage of this "feature".


Thanks

Re: negative latencies

2014-05-18 Thread shmeel gutl


On 19-May-14 09:39 AM, Andrew Pinski wrote:

On Sun, May 18, 2014 at 11:13 PM, shmeel gutl
 wrote:

Are there hooks in gcc to deal with negative latencies? In other words, an
architecture that permits an instruction to use a result from an instruction
that will be issued later.

Do you mean bypasses?  If so there is a bypass feature which you can use:
https://gcc.gnu.org/onlinedocs/gccint/Processor-pipeline-description.html#index-data-bypass-3773

Thanks,
Andrew Pinski

Unfortunately, bypasses in the pipeline description is not enough.
They only allow you to calculate the latency of true dependencies. They 
are also forced to be zero or greater. The real question is how the 
scheduler and register allocator can deal with negative latencies.


Thanks
Shmeel

At first glance it seems that it will will break a few things.
1) The definition of dependencies cannot come from the simple ordering of
rtl.
2) The scheduling problem starts to look like "get off the train 3 stops
before me".
3) The definition of live ranges needs to use actual instruction timing
information, not just instruction sequencing.

The hooks in the scheduler seem to be enough to stop damage but not enough
to take advantage of this "feature".

Thanks



-
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2014.0.4577 / Virus Database: 3950/7515 - Release Date: 05/18/14

Re: negative latencies

2014-05-19 Thread shmeel gutl


On 19-May-14 01:02 PM, Ajit Kumar Agarwal wrote:

Is it the case of code speculation where the negative latencies are used?
No. It is an exposed pipeline where instructions read registers during 
the required cycle. So if one instruction produces its results in the 
third pipeline stage and a second instruction reads the register in the 
sixth pipeline stage. The second instruction can read the results of the 
first instruction even if it is issued three cycles earlier.


Thanks & Regards
Ajit
-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of shmeel 
gutl
Sent: Monday, May 19, 2014 12:23 PM
To: Andrew Pinski
Cc: gcc@gcc.gnu.org; Vladimir Makarov
Subject: Re: negative latencies

On 19-May-14 09:39 AM, Andrew Pinski wrote:

On Sun, May 18, 2014 at 11:13 PM, shmeel gutl
 wrote:

Are there hooks in gcc to deal with negative latencies? In other
words, an architecture that permits an instruction to use a result
from an instruction that will be issued later.

Do you mean bypasses?  If so there is a bypass feature which you can use:
https://gcc.gnu.org/onlinedocs/gccint/Processor-pipeline-description.h
tml#index-data-bypass-3773

Thanks,
Andrew Pinski

Unfortunately, bypasses in the pipeline description is not enough.
They only allow you to calculate the latency of true dependencies. They are 
also forced to be zero or greater. The real question is how the scheduler and 
register allocator can deal with negative latencies.

Thanks
Shmeel

At first glance it seems that it will will break a few things.
1) The definition of dependencies cannot come from the simple
ordering of rtl.
2) The scheduling problem starts to look like "get off the train 3
stops before me".
3) The definition of live ranges needs to use actual instruction
timing information, not just instruction sequencing.

The hooks in the scheduler seem to be enough to stop damage but not
enough to take advantage of this "feature".

Thanks


-
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2014.0.4577 / Virus Database: 3950/7515 - Release Date:
05/18/14




-
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2014.0.4577 / Virus Database: 3950/7517 - Release Date: 05/18/14

Re: negative latencies

2014-05-20 Thread shmeel gutl


On 20-May-14 06:13 PM, Vladimir Makarov wrote:

On 05/19/2014 02:13 AM, shmeel gutl wrote:

Are there hooks in gcc to deal with negative latencies? In other
words, an architecture that permits an instruction to use a result
from an instruction that will be issued later.


Could you explain more on *an example* what are you trying to achieve
with the negative latency.

Scheduler is based on a critical path algorithm.  Generally speaking
latency time can be negative for this algorithm.  But I guess that is
not what you are asking.
The architecture has an exposed pipeline where instructions read 
registers during the required cycle. So if one instruction produces its 
results in the third pipeline stage and a second instruction reads the 
register in the sixth pipeline stage, the second instruction can read 
the results of the first instruction even if it is issued three cycles 
earlier.


The problem that I see is that the haifa scheduler schedules one cycle 
at a time, in a forward order, by picking from a list of instructions 
that can be scheduled without delays. So, in the above example, if 
instruction one is scheduled during cycle 3, it can't schedule 
instruction two during cycle 0, 1, or 2 because its producer dependency 
(instruction one) hasn't been scheduled yet. It won't be able to 
schedule it until cycle 3. So I am asking if there is an existing 
mechanism to back schedule instruction two once instruction one is issued.


Thanks,
Shmeel

At first glance it seems that it will will break a few things.
1) The definition of dependencies cannot come from the simple ordering
of rtl.
2) The scheduling problem starts to look like "get off the train 3
stops before me".
3) The definition of live ranges needs to use actual instruction
timing information, not just instruction sequencing.

The hooks in the scheduler seem to be enough to stop damage but not
enough to take advantage of this "feature".

Re: negative latencies

2014-05-23 Thread shmeel gutl


On 21-May-14 06:30 PM, Vladimir Makarov wrote:

I am just curious what happens when you put

insn2, insn1.

and insn2 uses a result of insn1 in 6 cycles and insn1 producing the 
result in 3 cycles, but there are not ready functional units (e.g. 
arithmentic units) necessary for insn1 for 4 or more cycles.  It is 
quite not trivial to guarantee that everything will be okay in general 
case if you put insn2 before insn1.


This is not a problem for this architecture. The units are fully 
pipelined and the only conflicts are in the first stage, during 
instruction issue. That is, the vliw must be legal. The gcc dfa handles 
this case fine.

Re: negative latencies

2014-05-23 Thread shmeel gutl


On 22-May-14 07:21 PM, Bernd Schmidt wrote:

On 05/21/2014 05:30 PM, Vladimir Makarov wrote:

On 2014-05-20, 5:18 PM, shmeel gutl wrote:

The problem that I see is that the haifa scheduler schedules one cycle
at a time, in a forward order, by picking from a list of instructions
that can be scheduled without delays. So, in the above example, if
instruction one is scheduled during cycle 3, it can't schedule
instruction two during cycle 0, 1, or 2 because its producer dependency
(instruction one) hasn't been scheduled yet. It won't be able to
schedule it until cycle 3. So I am asking if there is an existing
mechanism to back schedule instruction two once instruction one is
issued.



I see, thanks.  There is no such mechanism in the current insn
scheduler.


Well, the scheduler has support for an exposed pipeline that is used 
by the C6X port. Insns are split into multiple pieces which are forced 
to be scheduled at a fixed distance in time from each other, each 
piece describing the effects that occur at that point in time.  This 
could probably be made to work for this target's requirements, but it 
might run quite slowly.



Bernd
Exposed pipeline is not my problem. Negative latency is my problem. I 
don't see negative latency for c6x, not in unit reservations and not in 
adjust cost. Did I miss something?


Shmeel

Re: negative latencies

2014-05-25 Thread shmeel gutl


On 23-May-14 01:59 PM, Bernd Schmidt wrote:

On 05/23/2014 10:07 AM, shmeel gutl wrote:


Exposed pipeline is not my problem. Negative latency is my problem. I
don't see negative latency for c6x, not in unit reservations and not in
adjust cost. Did I miss something?


You just need to model it differently. Rather than saying instruction 
A has a negative latency relative to instruction B, you need to 
describe that instruction B reads its inputs later than when it is 
actually issued. The mechanism used in the C6X backend is the 
scheduler's record_delay_slot_pair function.


The scheduler would see

B is issued (*)

A is issued, executes and writes its outputs

B reads its inputs (*)

The two insns marked as (*) would be such a delay pair. The first one 
would generate code, the second one exists only for the purposes of 
building the right scheduling dependencies.



Bernd

Okay, I think that I have the idea. But I would still need to backtrack 
if the enabling instruction is not issued on time. I would also need to 
delay the dependent instruction if I can see in advance that the 
producer cannot be issued on time. And, as Vladimir pointed out, I need 
to watch out for various passes inserting unwanted instructions. Sounds 
like a big project.

Re: negative latencies

2014-05-25 Thread shmeel gutl


On 23-May-14 05:20 PM, Vladimir Makarov wrote:

On 2014-05-23, 3:49 AM, shmeel gutl wrote:

On 21-May-14 06:30 PM, Vladimir Makarov wrote:

I am just curious what happens when you put

insn2, insn1.

and insn2 uses a result of insn1 in 6 cycles and insn1 producing the
result in 3 cycles, but there are not ready functional units (e.g.
arithmentic units) necessary for insn1 for 4 or more cycles. It is
quite not trivial to guarantee that everything will be okay in general
case if you put insn2 before insn1.


This is not a problem for this architecture. The units are fully
pipelined and the only conflicts are in the first stage, during
instruction issue. That is, the vliw must be legal. The gcc dfa handles
this case fine.



Another problem is that besides insn-scheduler there are a lot of 
optimizations which can insert some insns between the two insns after 
scheduling.  In this case the result might be not ready for insn2.


So you at least should exclude 1st insn scheduling (before RA) and 
make 2nd insn scheduling as the very last pass.


In general, a traditional approach is to do such things on assembler 
level (e.g. as for older MIPS processor without hardware interlocks).



There is also IRA. I would need to tweak the definition of live range.

Re: Using associativity for optimization

2014-12-02 Thread shmeel gutl

It works fine for my test case.
Thanks

Richard Biener  wrote:

>On Tue, Dec 2, 2014 at 12:11 AM, shmeel gutl
> wrote:
>> While testing my implementation of passing arguments in registers, I noticed
>> that gcc 4.7 creates instruction dependencies when it doesn't have to.
>> Consider:
>>
>> int foo(int a1, int a2, int a3, int a4)
>> {
>> return a1|a2|a3|a4;
>> }
>>
>> gcc, even with -O2 generated code that was equivalent to
>>
>> temp1 = a1 | a2;
>> temp2 = temp1 | a3;
>> temp3 = temp2 | a4;
>>
>> return temp3;
>>
>> This code must be executed serially.
>>
>> Could I create patterns, or enable optimizations that would cause the
>> compiler to generate
>>
>> temp1 = a1 | a2;
>> temp2 = a3 | a4;
>> temp3 = temp1 | temp2;
>>
>> Thereby allowing the scheduler to compute temp1 and temp2 in parallel.
>
>You can tune it with --param tree-reassoc-width=N, not sure if that
>was implemented for 4.7 already.
>
>Richard.

Re: Using associativity for optimization

2014-12-02 Thread shmeel gutl


On 02-Dec-14 12:23 PM, Richard Biener wrote:

On Tue, Dec 2, 2014 at 12:11 AM, shmeel gutl
 wrote:

While testing my implementation of passing arguments in registers, I noticed
that gcc 4.7 creates instruction dependencies when it doesn't have to.
Consider:

int foo(int a1, int a2, int a3, int a4)
{
 return a1|a2|a3|a4;
}

gcc, even with -O2 generated code that was equivalent to

temp1 = a1 | a2;
temp2 = temp1 | a3;
temp3 = temp2 | a4;

return temp3;

This code must be executed serially.

Could I create patterns, or enable optimizations that would cause the
compiler to generate

temp1 = a1 | a2;
temp2 = a3 | a4;
temp3 = temp1 | temp2;

Thereby allowing the scheduler to compute temp1 and temp2 in parallel.

You can tune it with --param tree-reassoc-width=N, not sure if that
was implemented for 4.7 already.

Richard.


Works fine for this test case
Thanks

Re: A Question About LRA/reload

2014-12-09 Thread shmeel gutl


On 09-Dec-14 07:56 PM, Jeff Law wrote:

On 12/09/14 10:10, Vladimir Makarov wrote:

generate the correct code in many cases even for x86.  Jeff Law tried
IRA coloring reusage too for reload but whole RA became slower (although
he achieved performance improvements on x86).
Right.  After IRA was complete, I'd walk over the unallocated allocnos 
and split their ranges at EBB boundaries.  That created new allocnos 
with a smaller conflict set and reduced the conflict set for the 
original unallocated allocnos.


After I'd done that splitting for all the EBBs, I called back into 
ira_reassign_pseudos to try to assign the original unallocated 
allocnos as well as the new allocnos.


To get good results, much of IRA's cost analysis had to be redone from 
scratch.  And from a compile-time standpoint, that's a killer.


The other approach I was looking at was a backwards walk through each 
block.  When I found an insn with an unallocated pseudo that would 
trigger one of various range spliting techniques to try and free up a 
hard register.  Then again I'd call into ira_reassign_pseudos to try 
the allocations again.  This got even better results, but was 
obviously even more compile-time expensive.


I don't think much, if any, of that work is relevant given the current 
structure and effectiveness of LRA.


jeff


Are any of these versions available? I wouldn't mind a 50% penalty in 
compile time if it gave me a 5% improvement in the generated code.

rnreg and vliw

2015-01-23 Thread shmeel gutl

It seems that in gcc 4.7, the rnreg pass for renaming registers after 
reload is not vliw aware. In particular I saw it reassign a register 
that is in use in the same vliw.


To be more concrete, I saw it change the following pseudo code
DI:a30 = v0
SI:a14 = -a14

to
DI:a30 = v0
SI:a31 = -a14

since a31 was never referenced again. This won't work inside a vliw 
since it causes two instructions to set a31. Even though, rnreg runs 
before sched2, it runs after software pipelining which creates its own 
vliws.


Is there any easy fix for this.

Thanks,
Shmeel

Re: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE

2015-05-16 Thread shmeel gutl


On 16-May-15 03:49 PM, Ajit Kumar Agarwal wrote:

if (loop && loop->latch == bb
  || loop->header == bb)
Please add parenthesis to the various occurrences of this code fragment. 
Better if the precedence is explicit.

Question about find modifiable mems

2015-06-02 Thread shmeel gutl

find_modifiable_mems was introduced to gcc 4.8 in september 2012. Is 
there any documentation as to how it is supposed to help the haifa 
scheduler?


In my private port of gcc it make the following type of transformations

from
a= *(b+20)
b+=30

to
b+=30
a=*(b-10)

Although this is functionally correct, it has changed an ANTI_DEP into a 
TRUE_DEP and thus introduced stalls. If it went the other way, that 
would be good. Any pointers?


Thanks,
Shmeel

Memory dependence

2013-06-10 Thread shmeel gutl

In the architecture that I am using, there is a big pipeline penalty for 
read after write to the same memory location. Is it possible to tell the 
difference between a possible memory conflict and a definite memory 
conflict?

Testing timing tables

2013-06-23 Thread shmeel gutl

The latency calculations in my backend are very complicated. Is there 
any automated way to test them?

conflict between scheduler and register allocator

2013-08-09 Thread shmeel gutl

I am having trouble meeting the constraints of the scheduler and the 
register allocator for my back end. The relevant features are:


1) VLIW - up to 4 instructions can be issued each cycle
2) If a vliw bundle has both a set and a use, the use will use the old 
values.
3) A call instruction will push r30 and r31 to the stack making them 
natural candidates for callee saved.


The problem is that the scheduler might include an instruction that sets 
r30 in the same vliw as a call. This would result in a stale value being 
saved to the stack. (Note: the call instruction is not truly dependent 
on r30, just that r30 can't be set in the vliw that contains the call). 
On the other hand, if I declare that the call uses r30, the register 
allocator will refuse to use r30 since it thinks that the register is live.
I know that I can use a hook to fix-up the first problem by breaking a 
single vliw into two bundles, but that has a performance penalty. Is 
there a way to tell the scheduler to avoid issuing an instruction that 
sets a30 or a31 in the same bundle that contains a call instruction?


Thank you for any pointers.

Re: conflict between scheduler and register allocator

2013-10-28 Thread shmeel gutl


On 09-Aug-13 07:35 PM, Vladimir Makarov wrote:

On 13-08-09 7:25 AM, shmeel gutl wrote:
I am having trouble meeting the constraints of the scheduler and the 
register allocator for my back end. The relevant features are:


1) VLIW - up to 4 instructions can be issued each cycle
2) If a vliw bundle has both a set and a use, the use will use the 
old values.
3) A call instruction will push r30 and r31 to the stack making them 
natural candidates for callee saved.


The problem is that the scheduler might include an instruction that 
sets r30 in the same vliw as a call. This would result in a stale 
value being saved to the stack. (Note: the call instruction is not 
truly dependent on r30, just that r30 can't be set in the vliw that 
contains the call). On the other hand, if I declare that the call 
uses r30, the register allocator will refuse to use r30 since it 
thinks that the register is live.
I know that I can use a hook to fix-up the first problem by breaking 
a single vliw into two bundles, but that has a performance penalty. 
Is there a way to tell the scheduler to avoid issuing an instruction 
that sets a30 or a31 in the same bundle that contains a call 
instruction?


Thank you for any pointers.

You should look at haifa-sched.c::schedule_block.   There are a lot of 
hooks called at different stages of list scheduling algorithm. 
Depending on what the algorithm stage you want to do this, you can use 
a specific hook.  I'd pay attention to targetm.sched.reorder[2].


You also can look at the hooks implemented for IA64 as it is most 
widely used VLIW architecture for now.  But implementation of some 
IA64 hook are pretty big.


Thank you. reorder2 does indeed let me block conflicting insns from 
being issued  during the same cycle.

avoiding extra .loc directives

2013-11-05 Thread shmeel gutl

For my VLIW toolchain, I am not allowed to output .loc directives in the 
middle of a VLIW bundle. Following the lead of the bfin backend, I scan 
bundles during the machine reorg pass and ensure that all of the insns 
of a bundle have the same location. This solves most of the problems, 
but final.c will also output the .loc directive after encountering a 
NOTE_INSN_PROLOGUE_END. There are two obvious solutions to this:

1) eliminate the note in machine reorg.
2) eliminate the line "force_source_line = true;" from the appropriate 
line in final.c.


Would either of these alternatives cause a problem?
Is there a better way to avoid having .loc directives inside bundles?

Re: Dependency confusion in sched-deps

2013-12-05 Thread shmeel gutl


On 05-Dec-13 02:39 AM, Maxim Kuvyrkov wrote:

Dependency type plays a role for estimating costs and latencies between 
instructions (which affects performance), but using wrong or imprecise 
dependency type does not affect correctness.
On multi-issue architectures it does make a difference. Anti dependence 
permits the two instructions to be issued during the same cycle whereas 
true dependency and output dependency would forbid this.


Or am I misinterpreting your comment?

Re: Dependency confusion in sched-deps

2013-12-06 Thread shmeel gutl


On 06-Dec-13 01:34 AM, Maxim Kuvyrkov wrote:

On 6/12/2013, at 8:44 am, shmeel gutl  wrote:


On 05-Dec-13 02:39 AM, Maxim Kuvyrkov wrote:

Dependency type plays a role for estimating costs and latencies between 
instructions (which affects performance), but using wrong or imprecise 
dependency type does not affect correctness.

On multi-issue architectures it does make a difference. Anti dependence permits 
the two instructions to be issued during the same cycle whereas true dependency 
and output dependency would forbid this.

Or am I misinterpreting your comment?

On VLIW-flavoured machines without resource conflict checking -- "yes", it is 
critical not to use anti dependency where an output or true dependency exist.  This is 
the case though, only because these machines do not follow sequential semantics for 
instruction execution (i.e., effects from previous instructions are not necessarily 
observed by subsequent instructions on the same/close cycles.

On machines with internal resource conflict checking having a wrong type on the 
dependency should not cause wrong behavior, but "only" suboptimal performance.

Thank you,

--
Maxim Kuvyrkov
www.kugelworks.com


Earlier in the thread you wrote

Output dependency is the right type (write after write).  Anti dependency is 
write after read, and true dependency is read after write.
Should the code be changed to accommodate vliw machines.. It has been 
there since the module was originally checked into trunk.

exposed pipeline

2014-03-05 Thread shmeel gutl

For the 4.7 branch I only saw one architecture using exposed pipeline. 
Is there any documentation on the quality of exposed pipeline support? 
Does the back-end need to do anything special to deal with jumps and 
returns from calls?


Thanks
Shmeel

Re: Request for discussion: Rewrite of inline assembler docs

2014-03-28 Thread shmeel gutl


On 28-Mar-14 01:46 PM, Hannes Frederic Sowa wrote:

On Fri, Mar 28, 2014 at 09:41:41AM +, Andrew Haley wrote:
Ok, I see the problem. Maybe something like this by avoiding the term? 
Using this clobber causes the compiler to flush all (modified) 
registers being used to store values which gcc decided to originally 
allocate in memory before executing the @code{asm} statement.

What is true here is that all registers used to cache variables that
are reachable from pointers in the program are flushed.  Anything that
is statically allocated is reachable, as is anything dynamically
allocated by malloc; auto variables are not reachable unless their
address is taken.

One would have to go into detail of various optimizations which could
remove the address taking e.g. IMHO the last sentence of the paragraph already
deals with this.

Bye,

   Hannes

By this wording (gcc decided), the consequences are unpredictable. You 
should probably mention the registers which will definitely be flushed 
and thereby limit the ambiguity.

Re: Question about find modifiable mems

porting to lra

Problem with tree pass pre

Re: Problem with tree pass pre

Re: Acceptance criteria for the git conversion

Help porting to lra

Re: Question about find modifiable mems

Help with lra

Re: Help with lra

Re: Help with lra

Re: Help with lra

negative latencies

Re: negative latencies

Re: negative latencies

Re: negative latencies

Re: negative latencies

Re: negative latencies

Re: negative latencies

Re: negative latencies

Re: Using associativity for optimization

Re: Using associativity for optimization

Re: A Question About LRA/reload

rnreg and vliw

Re: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE

Question about find modifiable mems

Memory dependence

Testing timing tables

conflict between scheduler and register allocator

Re: conflict between scheduler and register allocator

avoiding extra .loc directives

Re: Dependency confusion in sched-deps

Re: Dependency confusion in sched-deps

exposed pipeline

Re: Request for discussion: Rewrite of inline assembler docs

34 matches

Site Navigation

Mail list logo

Footer information