The subreg question

2005-04-17 Thread Ling-hua Tseng
I have a chip which is developed by other lab.
It's VLIW architecture and it contains 2 RISCs and 8 DSPs.
The size of all registers are 32 bits.
There is a special instruction in the RISC which is called `movi' (move 
immediate).
Its syntax and semantic are:
movilr1, #   (moves # to LSB 16-bit, without changes MSB 
16-bit)
movils   r1, #   (moves # to LSB 16-bit, and sets the MSB 16-bit 
to zero)
movim  r1, #   (moves # to MSB 16-bit, without changes LSB 16-bit)
movims r1, #   (moves # to MSB 16-bit, and sets the LSB 16-bit to 
zero)
It's obvious that `movil' and `movim' are only access the partial 16-bit of the 
32-bit register.
How can I use RTL expression to represent the operations?
(I should implement the standard pattern `movsi'  in the machine description,
and I tried to design an define_split to generate a 32-bit immediate value)
It's the define_split RTX in the machine description:
(define_split
 [(set (match_operand:SI 0 "register_operand" "")   
   (match_operand:SI 1 "immediate_operand" ""))]
 "!valid_10bit_immediate(INTVAL(operands[1]))"
 
{
   
})

First, I tried to write the  in the following form:
 [(set (strict_low_part (subreg:HI (match_dup 0) 0)) (match_dup 2))
  (set (subreg:HI (match_dup 0) 2)) (match_dup 3))]   <--- (*1)
And I wrote the  in the following form:
   operands[2] = GEN_INT(INTVAL(operands[1]) & 0x);
   operands[3] = GEN_INT((INTVAL(operands[2]) & 0x) >> 16);  
... (*2)
The expression (*1) is in want of a RTX `(strict_high_part x)' but it's not 
existing.
So I just writing it without `strict' semantic.
But the subreg RTX is not accepted by the gcc.
The emit-rtl.c: validate_subreg( ) will return false at the line 692 (gcc 4.0 
20050416).
(BTW, the gen_highpart( ) is also failed when I trying to generate the MSB 
16-bit operand.)
After some studying, I know that the `subreg' RTX is only able to represent the 
LSB n-bit.
Instead, I wrote the following form to solve it temporarily:
 [(set (match_dup 0)
   (ior:SI (match_dup 0) (match_dup 2)))
  (set (match_dup 0)
   (ior:SI (match_dup 0) (match_dup 3)))]
{
   operands[2] = GEN_INT(INTVAL(operands[1]) & 0x);
   operands[3] = GEN_INT(INTVAL(operands[1]) & 0x);
})
Unfortunately, I need to face this problem eventually.
The DSP function unit provides `packing' and SIMD operations.
We can pack four 8-bit integer values into one 32-bit register,
and then do an arithmetic operation by one instruction.
So I need to treat one 32-bit register as four 8-bit sub-registers.
But I couldn't represent them by:
   (subreg:QI (reg:SI xx) 0), (subreg:QI (reg:SI xx) 1), (subreg:QI (reg:SI xx) 
2), and (subreg:QI (reg:SI xx) 3).
Would anyone teach me how to use RTL expression to represent these 
sub-registers?
Thanks a lot.


Re: The subreg question

2005-04-19 Thread Ling-hua Tseng
James E Wilson wrote:
Ling-hua Tseng wrote:
It's obvious that `movil' and `movim' are only access the partial 
16-bit of the 32-bit register. How can I use RTL expression to 
represent the operations?
As you noticed, within a register, subreg can only be used for low
parts.  You can't ask for the high part of a single register.  If you
have an item that spans multiple registers, e.g. a 64-bit value that is
contained in a register pair, then you can ask for the SImode highpart
of a DImode reg and get valid RTL.  This works because the high part is
an entire register.  This isn't useful to you.
Otherwise, you can access subparts via bitfield insert/extract
operations, or logicals operations (ior/and), though this is likely to
be tedious, and may confuse optimizers.
There are high/lo_sum RTL operators that may be useful to you.  You can use
 (set (reg:SI) (high: ...))
 (set (reg:SI) (lo_sum (reg:SI) (...)))
where the first pattern corresponds to movims, and the second one to
movil.  You could just as well use ior instead of lo_sum for the second
pattern, this is probably better as movil does not do an add.
You may want to emit normal rtl for an SImode move, and then split it
into its two 16-bit parts after reload.  This will avoid confusing RTL
optimizers before reload.
We have vector modes which might be useful to you.  If you say a
register is holding a V4QI mode value, then there are natural ways to
get at the individual elements of the vector via vector operations.
I read the descriptions of (high:m exp) and (lo_sum:m x y) in the gcc 
internal manuls (Section 10.7 and 10.9).
The last line of their descriptions confused me because they wrote "m should be 
Pmode".
Is it really a strict rule?
The RTX "(set (reg:SI xx) (high:m yy))" seems to let m to be an integer mode.
Doesn't it lead to some undefined behaviors in the back-end passes?


Re: The subreg question

2005-04-23 Thread Ling-hua Tseng
James E Wilson wrote:
Ling-hua Tseng wrote:
It's obvious that `movil' and `movim' are only access the partial 
16-bit of the 32-bit register. How can I use RTL expression to 
represent the operations?
As you noticed, within a register, subreg can only be used for low
parts.  You can't ask for the high part of a single register.  If you
have an item that spans multiple registers, e.g. a 64-bit value that is
contained in a register pair, then you can ask for the SImode highpart
of a DImode reg and get valid RTL.  This works because the high part is
an entire register.  This isn't useful to you.
Otherwise, you can access subparts via bitfield insert/extract
operations, or logicals operations (ior/and), though this is likely to
be tedious, and may confuse optimizers.
There are high/lo_sum RTL operators that may be useful to you.  You can use
 (set (reg:SI) (high: ...))
 (set (reg:SI) (lo_sum (reg:SI) (...)))
where the first pattern corresponds to movims, and the second one to
movil.  You could just as well use ior instead of lo_sum for the second
pattern, this is probably better as movil does not do an add.
You may want to emit normal rtl for an SImode move, and then split it
into its two 16-bit parts after reload.  This will avoid confusing RTL
optimizers before reload.
We have vector modes which might be useful to you.  If you say a
register is holding a V4QI mode value, then there are natural ways to
get at the individual elements of the vector via vector operations.
I implemented my 4 `movi' cases in the following forms.
The result of them sould be a 32-bit integer because the `movi' is in order to 
generate SImode immediate.
The case 4 is special. It's also used to generate HImode and QImode immediates.
Would you like to help me confirm them? (RTX semantics and the usage of `HI' 
mode)
   1. set MSB 16-bit and clear LSB 16-bit to zero
   [(set (match_operand:SI 0 "register_operand" "=r")
   (high:SI (match_operand:SI 1 "immediate_operand" "i")))]
   (Does (high:SI ...) have the semantic of clearing LSB 16-bit ? )
   2. set MSB 16-bit and unchange/keep LSB 16-bit
   [(parallel
   [(set (high:SI (match_operand:SI 0 "register_operand" "=r"))
   (high:SI (match_operand:SI 1 "immediate_operand" "i")))
(set (strict_lowpart (subreg:HI (match_dup 0) 0))
   (match_operand:HI "immediate_operand" "i"))])]
   (I know it's incorrect if the semantic of (high:SI ...) will change LSB 
16-bit.)
   3. set LSB 16-bit and clear MSB 16-bit to zero
   [(set (match_operand:SI 0 "register_operand" "=r")
   (match_operand:HI 1 "immediate_operand" "i"))]
   4. set LSB 16-bit and unchange/keep LSB 16-bit
   [(set (strict_lowpart (subreg:HI (match_operand:SI 0 "register_operand" 
"=r") 0))
   (match_operand:HI "immediate_operand" "i"))]
   (Would it better than use (lo_sum:SI ...) ? )
Thanks.


Can gcc select vector mode instruction patterns itself?

2005-05-10 Thread Ling-hua Tseng
I'm porting gcc to a new VLIW architecture.
There are 11 function units in the chip, and 4 of them are DSPs.
Now I'm designing the SIMD instruction patterns, and I wouldn't like use the 
built-in functions to support it.
If I wrote some instruction patterns which involved many V4QI 
packing/unpacking/arithmetic operations,
could gcc try to select them automatically and smartly?
(Of course I never wrote any define_expand/define_split to generate any V4QI 
operations myself.)
For example:
1. my packing instruction patterns ('D' means DSP register):
   (define_insn "*packqi_from_mem"
   [(set (vec_select:QI (match_operand:V4QI 0 "register_operand" "D")
   (parallel [(match_operand:SI 2 "const_int_operand" 
"i")]))
   (match_operand:QI 1 "memory_operand" "m"))]
   ""
   "ldub.b%2\\t%0, %1"
   )
2. my V4QI + V4QI SIMD operation
   (define_insn "*SIMD_addqi3"
   [(set (match_operand:V4QI 0 "register_operand" "=D")
   (plus:V4QI (match_operand:V4QI 1 "register_operand" "%D")
  (match_operand:V4QI 2 "register_operand" 
"D")))]
   ""
   "add.ub\\t%0, %1, %2"
   )
Is it possible that gcc can try to load 4 QImode value to a register by the pattern 
"*packqi_from_mem"
and perform the V4QI + V4QI SIMD add by the pattern "*SIMD_addqi3" itself?


Why the V4QImode vector operations are expanded into many SImode at "oplower" pass?

2005-05-18 Thread Ling-hua Tseng
I saw the ARM's porting and knew that ARM have V8QI SIMD operation supporting.
I'm porting another platform, and the platform is also supporting SIMD 
operations.
Now I'm implementing the V4QI SIMD add operation.
(with gcc version 4.0.1 20050514)
I did the following steps:
   1. added VECTOR_MODES(INT, 4); to my -modes.def
   2. implemented the "movv4qi" and "addv4qi3" expander definitions and 
corresponding
   instruction patterns in the machine description file.
   3. let the hook "TARGET_VECTOR_MODE_SUPPORTED_P" is always return true if
   the mode is V4QImode (written in the .c)
And then I wrote the following test program:
=[top]===
typedef char v4qi __attribute__((vector_size(4)));
v4qi foo();
v4qi a = { 0x11, 0x22, 0x33, 0x44 };
int main()
{
   volatile v4qi x;
   x = foo();
   return 0;
}
v4qi foo()
{
   v4qi x = (v4qi)0xaabbccdd, y = a, z;
   z = x + y + a;
   return z;
}
=[end]===
It didn't work.
I passed the option '-fdump-tree-all' to gcc and got the following contents in 
".t13.cfg":
=[top]===
;; Function foo (foo)
Merging blocks 0 and 1
foo ()
{
 v4qi z;
 v4qi y;
 v4qi x;
 v4qi D.1238;
 v4qi a.0;
 v4qi D.1236;
 # BLOCK 0
 # PRED: ENTRY (fallthru)
 x = (vector char) 0aabbccdd;
 y = a;
 D.1236 = x + y;
 a.0 = a;
 z = D.1236 + a.0;
 D.1238 = z;
 return D.1238;
 # SUCC: EXIT
}
=[end]===
(I eliminated the 'main' function because we only need to concern with the 
function 'foo'.)
In the next optimization pass dump file, ".t14.oplower", I got:
=[top]===
;; Function foo (foo)
foo ()
{
 unsigned int D.1262;
 unsigned int D.1261;
 unsigned int D.1260;
 unsigned int D.1259;
 unsigned int D.1258;
 unsigned int D.1257;
 unsigned int D.1256;
 unsigned int D.1255;
 unsigned int D.1254;
 unsigned int D.1253;
 unsigned int D.1252;
 unsigned int D.1251;
 unsigned int D.1250;
 unsigned int D.1249;
 unsigned int D.1248;
 unsigned int D.1247;
 v4qi z;
 v4qi y;
 v4qi x;
 v4qi D.1238;
 v4qi a.0;
 v4qi D.1236;
:
 x = (vector char) 0aabbccdd;
 y = a;
 D.1247 = VIEW_CONVERT_EXPR(x);
 D.1248 = VIEW_CONVERT_EXPR(y);
 D.1249 = D.1247 ^ D.1248;
 D.1250 = D.1248 & 2139062143;
 D.1251 = D.1247 & 2139062143;
 D.1252 = D.1249 & 080808080;
 D.1253 = D.1251 + D.1250;
 D.1254 = D.1253 ^ D.1252;
 D.1236 = VIEW_CONVERT_EXPR(D.1254);
 a.0 = a;
 D.1255 = VIEW_CONVERT_EXPR(D.1236);
 D.1256 = VIEW_CONVERT_EXPR(a.0);
 D.1257 = D.1255 ^ D.1256;
 D.1258 = D.1256 & 2139062143;
 D.1259 = D.1255 & 2139062143;
 D.1260 = D.1257 & 080808080;
 D.1261 = D.1259 + D.1258;
 D.1262 = D.1261 ^ D.1260;
 z = VIEW_CONVERT_EXPR(D.1262);
 D.1238 = z;
 return D.1238;
}
=[end]===
The vector operations are expanded into many XOR, AND, and ADD operations,
so the RTL expansion pass is never generate any vector operations.
I modified the program to 'V8QI' version and compiled it by arm's iWMMXt 
porting.
The situation didn't appear.
So I guess that there are some miss-configured in my ports, but I can't find it.
(maybe I missed some settings of target machine hooks or macros)
Would anyone like to help me to solve the problem?
Thanks a lot.


Re: Why the V4QImode vector operations are expanded into many SImode at "oplower" pass?

2005-05-18 Thread Ling-hua Tseng
On Wed, 18 May 2005 17:25:35 +0200, Paolo Bonzini wrote
> > Now I'm implementing the V4QI SIMD add operation.
> 
> Maybe there is no register that can store a V4QI.
> 
> Paolo
Doesn't the register allocation pass perform in the RTL optimization passes?
Could it affect the tree-level optimization pass?

BTW,
I have tried to adjust the constraints to 'r' (general registers) for 
the "movv4qi" and "addv4qi" insn patterns,
but I got the same problem.

Thanks.


Re: Why the V4QImode vector operations are expanded into many SImode at "oplower" pass?

2005-05-18 Thread Ling-hua Tseng
On 18 May 2005 12:54:03 -0400, Ian Lance Taylor wrote
> "Ling-hua Tseng" <[EMAIL PROTECTED]> writes:
> 
> > I have tried to adjust the constraints to 'r' (general registers) for 
> > the "movv4qi" and "addv4qi" insn patterns,
> > but I got the same problem.
> 
> What about HARD_REGNO_MODE_OK?
> 
> Ian
If the register number is less than FIRST_PSEUDO_REGISTER,
it will always return 1.



Re: Why the V4QImode vector operations are expanded into many SImode at "oplower" pass?

2005-05-18 Thread Ling-hua Tseng
On Wed, 18 May 2005 12:19:47 -0700, Richard Henderson wrote
> On Wed, May 18, 2005 at 11:10:42PM +0800, Ling-hua Tseng wrote:
> > So I guess that there are some miss-configured in my ports, but I can't 
> > find it.
> 
> Put a breakpoint at tree-complex.c line 962.  Examine the conditions
> leading up to
> 
>   if ((GET_MODE_CLASS (compute_mode) == MODE_VECTOR_INT
>|| GET_MODE_CLASS (compute_mode) == MODE_VECTOR_FLOAT)
>   && op != NULL
>   && op->handlers[compute_mode].insn_code != 
> CODE_FOR_nothing)return;
> 
> to find out why the return isn't taken.  There aren't really very
> many options.
> 
> The one that jumps first to my mind is that the "addv4qi3" 
> instruction pattern doesn't actually exist because you have a typo 
> in the name.
> 
> r~
A very strange thing was happened.
I put the breakpoint there (in my tree-complex.c, that is line 904),
and find out it's always false in the first part of condition expression.
That's because the compute_mode is SImode. (I never modified any source code)

In order to confirm this, I put the line before the if statement:
printf("%s\n", GET_MODE_NAME(compute_mode));
So the part of program looks like the following:
===[top]
  if (compute_type == type)
{
  printf("%s\n", GET_MODE_NAME(compute_mode));
  if ((GET_MODE_CLASS (compute_mode) == MODE_VECTOR_INT
   || GET_MODE_CLASS (compute_mode) == MODE_VECTOR_FLOAT)
  && op != NULL
  && op->handlers[compute_mode].insn_code != CODE_FOR_nothing)
return;
  else
{
  /* There is no operation in hardware, so fall back to scalars.  */
  compute_type = TREE_TYPE (type);
  compute_mode = TYPE_MODE (compute_type);
}
}
===[end]

And then I re-compiled gcc, re-compiled my test program...
I got 4 lines "SI".
(In the ARM's iWMMXt V8QI testing, I got the message: "V8QI")

I used the GDB to put the breakpoint again,
and type "print (((rhs)->common.type)->common.code)".
I got '$1 = VECTOR_TYPE' and got '$2 = VECTOR_TYPE', '$3 = VECTOR_TYPE',
'$4 = VECTOR_TYPE' in the next 3 iterations.

I'm confused.
Are there any target machine macros or hooks setting the VECTOR_TYPE tree 
node to SImode?
I checked the .t13.cfg again and didn't find any difference with my 
earlier posted.


Re: Why the V4QImode vector operations are expanded into many SImode at "oplower" pass?

2005-05-18 Thread Ling-hua Tseng
On Wed, 18 May 2005 14:17:59 -0700, Richard Henderson wrote
> On Thu, May 19, 2005 at 04:58:32AM +0800, Ling-hua Tseng wrote:
> > I got 4 lines "SI".
> > (In the ARM's iWMMXt V8QI testing, I got the message: "V8QI")
> 
> Then you need to debug your targetm.vector_mode_supported_p.
> 
> Starting around stor-layout.c line 1609:
> 
> for (; mode != VOIDmode ; mode = GET_MODE_WIDER_MODE 
> (mode))  if (GET_MODE_NUNITS (mode) == nunits
>   && GET_MODE_INNER (mode) == innermode  && 
> targetm.vector_mode_supported_p (mode))break;
> 
> /* For integers, try mapping it to a same-sized scalar 
> mode.  */if (mode == VOIDmode&& 
> GET_MODE_CLASS (innermode) == MODE_INT)  mode = 
> mode_for_size (nunits * GET_MODE_BITSIZE (innermode),
> MODE_INT, 0);
> 
> The compiler has passed through the first loop without finding
> a supported vector mode that matches nunits=4 && inner=QImode.
> 
> r~
The targetm.vector_mode_supported_p is pointed to the genernal 
hook "hook_bool_mode_false".
But I have already put the following lines in my .c:
===[top]
...
#include "target-def.h"
...
static bool unicore_vector_mode_supported_p(enum machine_mode mode);
...
struct gcc_target targetm = TARGET_INITIALIZER;
...
#undef  TARGET_VECTOR_MODE_SUPPORTED_P
#define TARGET_VECTOR_MODE_SUPPORTED_P  unicore_vector_mode_supported_p
...
static bool unicore_vector_mode_supported_p(enum machine_mode mode)
{
switch(mode) {
case V4QImode:
case V2HImode:
return true;
default:
return false;
}
}
...
===[end]

Doesn't it enough to let the targetm.vector_mode_supported_p to be pointed to
my unicore_vector_mode_supported_p() ?



Re: Why the V4QImode vector operations are expanded into many SImode at "oplower" pass?

2005-05-18 Thread Ling-hua Tseng
On Wed, 18 May 2005 15:56:27 -0700, Richard Henderson wrote
> On Thu, May 19, 2005 at 06:02:39AM +0800, Ling-hua Tseng wrote:
> > struct gcc_target targetm = TARGET_INITIALIZER;
> > ...
> > #undef  TARGET_VECTOR_MODE_SUPPORTED_P
> > #define TARGET_VECTOR_MODE_SUPPORTED_P  unicore_vector_mode_supported_p
> 
> This is your bug.  The TARGET_INITIALIZER needs to come last.
> 
> r~
I'm sorry that I made the foolish mistake. 
I'll pay attention to it forever. 


The VLIW bundle output questions

2005-05-19 Thread Ling-hua Tseng
I'm porting gcc to a uni-core architecture (i.e., only one core).
There are 10 function units:
   (1) 2 RISCs: the 2 RISC have the same capability and they can do load/store, full-word arithmetic/logic
   operations, register move, ...
   (2) 4 DSPs ( 2 MAC, 1 BSU, and 1 VFU):
* MAC: can do the multiply-accumulate, and SIMD arithmetic operations
* BSU: packing/unpacking, determine absolute value, average, ...
* VFU: packing/unpacking, swap, bit reverse, determine min/max, ...
   (3) 4 CFUs (Customized Function Unit: do some MPEG4 decoding related operations):
* VLD CFU: MV/DC/AC decoder
* DCT/IDCT CFU: some instructions for DCT/IDCT 
* MC CFU: some instructions for motion compensation/estimation
* (didn't implemented yet)

There are 8 slots in a VLIW instruction bundle (i.e, can issue at most 8 
instructions in 1 cycle),
and the assembly language syntax looks like:
   " , , , ..."
For example:
===[top]
   mov .risc0 r1, #25\\
   ldw  .risc0 r2, [fp, #30] \\
   addub  .mac0d0, d4, d3\\
   subub  .mac1d11, d7, d4
   add  .risc0 r3, r1, r5
===[end]
(The symbol "\\" means "parallel". The next instruction will be issued at the 
same cycle.)
The first 4 instructions are in the same VLIW bundle (issued in the first 
cycle),
and the last one instruction is in other VLIW bundle (issued in the next cycle).
I plan to schedule the instructions by the "pipeline description".
Currently I have three questions after I reading the Ch10 ~ 13 of GCC internals 
manual:
   (1) How can I output the parallel symbol "\\" in the final pass?
 It's obvious that I should append the "\\" to some instructions 
which are in the same bundle,
 but I didn't find out the corresponding target machine 
macros/hooks to do so.
   (2) How can I fill the  field?
 
 (Could the questions, (1) and (2), be solved by using the macro PRINT_OPERAND? )

   (3) Should I put only one machine instruction in each instruction 
pattern?
 In other platform portings, I saw there are more than 1 machine instructions 
in the "output templates".
 For example: "add\\t%Q0, %Q0, %Q2\;adc\\t%R0, %R0, %R2".
 Some of the output templates will call a C function to output many 
instructions which shouldn't have
 the same characteristics in the function unit pipeline.
 I'm worried that the "multi-instructions" output template will 
confuse the DFA
 and will casue many instructions in one of VLIW bundle slots.
 Should I split them by define_split and design the corresponding 
refined instruction patterns for them?


A constant pool and addressing mode question

2005-05-21 Thread Ling-hua Tseng

I'm porting GCC 4.0.1 to a new architecture.
Its load/store instructions are similar to ARM's.

The RTL is always generating a symbol_ref RTX to access a global variable,
and the symbol_ref is an immediate which will be determined at the 
assembling/linking time.
The addressing modes of my architecture didn't support the "direct addressing 
mode",
that is, the syntax "load, " isn't allowed.
I can only use the the two forms:
   "load, [#]" and
   "load, []".

I cannot generate the following instructions if I have limited length of 
immediate field:
   mov, 
   load, [#0]

I noticed that the ARM's porting used the hook "TARGET_MACHINE_DEPENDENT_REORG" 
to solve it.
It generated a "mini constant pool" in the proper location by computing the 
insn length attributes,
and replaced the "ldr, " to "ldr, ".
The mini constant pool looks like:
.L3:
   .word
.L4:
   .word
...

And I has found the comment which is just before the implementation of the hook 
for a long time:
/* Gcc puts the pool in the wrong place for ARM, since we can only
  load addresses a limited distance around the pc.  We do some 
  special munging to move the constant pool values to the correct

  point in the code.  */

In my porting, I have no idea to let GCC to generate the constant pool with 
symbol names if I don't use
the TARGET_MACHINE_DEPENDENT_REORG hook.
(Currently, there are only string literal constant pools appeared in my 
porting.)
So I cannot understand what the comment "Gcc puts the pool in the wrong place for 
ARM" means.

I have three questions:
   1. Could GCC generate the kind of constant pools which are contained symbol 
names by setting some
   target machine macros? (even if it was broken)
   2. Are there any general soultions by telling GCC how far the load/store 
instruction could access?
   3. In order to reduce the memory accessing operations, I want to use some 
special assmebly code syntax:
   movimr1, /highpart@ move MSB 16-bit of 
the symbol address to r1 [31..16]
   movil  r1, /lowpart @ move LSB 16-bit 
of the symbol address to r1 [15..0]
   instead of the ARM's solution:
   ldr  r1,  .L3
   ...
   .L3:
   .word 
   Of course it should to modify the assembler/linker and loader.
   Is it a good idea?

Thanks a lot.


question about match_operand and vec_select

2005-06-19 Thread Ling-hua Tseng

I noticed that the (vec_select:m ...) couldn't be matched by (match_operand:m 
...).
For example:
  (set (vec_select:HI (reg:V4QI r3)
  (parallel [(const_int 0) (const_int 1)]))
   (const_int 0x1122))
couldn't be matched by:
 [(set (match_operand:HI 0 "register_operand" "=R")
   (match_operand 1 "const_int_operand" "i"))]

Only the RTL templates which contained explicit (vec_select:HI ...) are matched.

Is the situation natural and right?

Thanks a lot.


Question of `internal consistency failure' in the backend pass 32 (sched2)

2005-07-06 Thread Ling-hua Tseng

My GCC version is gcc version 4.0.1 20050630 (prerelease).

I got an error `internal consistency failure' in the backend pass 32.
This error was generated by flow.c:verify_local_live_at_start().
The RTL dump, .c.32.sched2, printed:
[begin]---
live_at_start mismatch in bb 0, aborting
New:

first = 0x84053cc current = 0x84053cc indx = 0
   0x84053cc next = (nil) prev = (nil) indx = 0
   bits = { 2 11 14 15 }
Old:
;; basic block 0, loop depth 0, count 0
;; prev block -1, next block -2
;; pred:   ENTRY [100.0%]  (fallthru)
;; succ:   EXIT [100.0%] 
;; Registers live at start:  2 [r2] 3 [r3] 11 [r11] 14 [r14] 15 [r15]

-[end]

The r2 and r3 are represented a V8QI register (a double-word data).
Now I described the major processes about r2 and r3 in the backend.

1. In the .c.00.expand, it was expanded to:
(insn 12 11 13 1 (set (reg:V8QI 107)
   (const_vector:V8QI [
   (const_int -86 [0xffaa])
   (const_int -69 [0xffbb])
   (const_int -52 [0xffcc])
   (const_int -35 [0xffdd])
   (const_int -18 [0xffee])
   (const_int -1 [0x])
   (const_int 18 [0x12])
   (const_int 52 [0x34])
   ])) -1 (nil)
   (nil))

2. In the .c.23.sched, it was split to 4 RTXs:
(insn 33 10 34 0 (set (vec_select:HI (subreg:V4QI (reg:V8QI 107) 0)
   (parallel [
   (const_int 0 [0x0])
   (const_int 1 [0x1])
   ]))
   (const_int 48042 [0xbbaa])) 1 {*movv4qi_lowpart} (nil)
   (nil))

(insn 34 33 35 0 (set (vec_select:HI (subreg:V4QI (reg:V8QI 107) 0)
   (parallel [
   (const_int 2 [0x2])
   (const_int 3 [0x3])
   ]))
   (const_int 56780 [0xddcc])) 2 {*movv4qi_highpart} (nil)
   (nil))

(insn 35 34 36 0 (set (vec_select:HI (subreg:V4QI (reg:V8QI 107) 4)
   (parallel [
   (const_int 0 [0x0])
   (const_int 1 [0x1])
   ]))
   (const_int 65518 [0xffee])) 1 {*movv4qi_lowpart} (nil)
   (nil))

(insn 36 35 11 0 (set (vec_select:HI (subreg:V4QI (reg:V8QI 107) 4)
   (parallel [
   (const_int 2 [0x2])
   (const_int 3 [0x3])
   ]))
   (const_int 13330 [0x3412])) 2 {*movv4qi_highpart} (nil)
   (nil))

3. In the .c.23.greg, the pass did `reload_in_reg: (reg:V8QI 2 r2 [107])':
(insn 33 11 34 0 (set (vec_select:HI (reg:V4QI 2 r2 [107])
   (parallel [
   (const_int 0 [0x0])
   (const_int 1 [0x1])
   ]))
   (const_int 48042 [0xbbaa])) 1 {*movv4qi_lowpart} (nil)
   (nil))

(insn 34 33 35 0 (set (vec_select:HI (reg:V4QI 2 r2 [107])
   (parallel [
   (const_int 2 [0x2])
   (const_int 3 [0x3])
   ]))
   (const_int 56780 [0xddcc])) 2 {*movv4qi_highpart} (nil)
   (nil))

(insn 35 34 36 0 (set (vec_select:HI (reg:V4QI 3 r3 [+4 ])
   (parallel [
   (const_int 0 [0x0])
   (const_int 1 [0x1])
   ]))
   (const_int 65518 [0xffee])) 1 {*movv4qi_lowpart} (nil)
   (nil))

(insn 36 35 13 0 (set (vec_select:HI (reg:V4QI 3 r3 [+4 ])
   (parallel [
   (const_int 2 [0x2])
   (const_int 3 [0x3])
   ]))
   (const_int 13330 [0x3412])) 2 {*movv4qi_highpart} (nil)
   (nil))

Because it's a double-word pseudo register,
the register allocator allocated 2 registers (r2 and r3).
The 4 RTXs will not be changed in the future passes.
Finally, the error was occured in the pass 32.

The information of .c.32.sched2 seems to tell me that
there are something wrong with register live bitmap in the CFG.
The bitmap is { 2 11 14 15 },
and I believe that it should be { 2 "3" 11 14 15 }.

Thus, I guess that I missed some configurations in the target hooks or target 
machine macros.
Would anyone like to tell me what hooks/macros could afffect the live bitmaps 
in the CFG?

Thanks a lot.


Question of vector type extending

2005-07-19 Thread Ling-hua Tseng

I am porting gcc to a new platform which is supported vector arithmetic 
operations.
(I'm using the latest 4.0.x snapshot version and upgrading it every week.)

Currently, we can write the following multiply-accumulation RTL template for 
non-vector type:
 [(set (match_operand:DI 0 "register_operand" "=r")
   (plus:DI
   (mult:DI (sign_extend:DI (match_operand:SI 2 "register_operand" "r")) 
(sign_extend:DI (match_operand:SI 3 "register_operand" "r")))

   (match_operand:DI 1 "register_operand" "r")))]

That instruction pattern will match the following C code:
   long long x;
   int a = 10, b = 9;

   y = x + (long long)a * (long long)b;

In this platform, the multiply-accumulation operations has 2 steps:
   1. load 64-bit value to accumulator (ex. macldd0, d1)
   2. do the multiply-accumulation N times with two 32-bit registers (ex. 
maddd2, d3)
The machine will evaluate (d0, d1) + d2 * d3.
Thus, the above RTL template used (sign_extend:DI (reg:SI x)).

Now I trying to apply the same way to vector modes, such as:
 [(set (match_operand:V2SI 0 "register_operand" "=r")
   (plus:V2SI
   (mult:V2SI (sign_extend:V2SI (match_operand:V2HI 2 "register_operand" 
"r"))
  (sign_extend:V2SI (match_operand:V2HI 3 "register_operand" 
"r")))
   (match_operand:V2SI 1 "register_operand" "r")))]

And I wrote the following C code with GNU C extensions:
   typedef char v2hi __attribute__((vector_size(4)));
   typedef char v2si __attribute__((vector_size(8)));

   v2si x;
   v2hi a = {1, 2}, b = {3, 4};

   y = x + (v2si)a * (v2si)b;

I know a vector type couldn't cast/convert to the type with different vector 
size,
but I want to treat the `casting/converting' operations as `extending' 
operations.

Are there any solutions for this situation?
Thanks a lot.



Can I use SCHED_GROUP_P to make the VLIW bundle in the final pass?

2005-07-22 Thread Ling-hua Tseng

I'm porting the GCC 4.0.x snapshots to a VLIW architecture.
Currently, I need to bundle the instructions.

I want to use the "%P" (means parallel execution with the next insn) in the 
output template of
(define_insn ...) in the MD, and I want to use the SCHED_GROUP_P to determine 
whether
the next insn should be bundled with the current insn (by filling out the %P in 
the PRINT_OPERAND) .

Is this a correct solution for me?
Thanks a lot.



Question of the DFA scheduler

2005-08-10 Thread Ling-hua Tseng

I'm porting gcc 4.0.1 to a new VLIW architecture.
Some of its function units doesn't have internal hardware pipeline forwarding,
so I need to insert "nop" instructions in order to resovle the data hazard.

I used the automata based pipeline description for my ports,
I described the data latency time by `define_insn_reservation',
and I'm trying to insert the "nop" in the hook TARGET_MACHINE_DEPENDENT_REORG.

The implementation of this hook is simple.
I just run the DFA scheduler again manually,
and I just let the insns to be issued as well as 2nd sched pass.
The following codes are my implementation:
[TOP]---
void unicore_reorg(void)
{
   bool func_start = true;
   int stalls;
   rtx insn = NULL, new_insn;
   state_t dfa_state = alloca(state_size());
   
   dfa_start();

   state_reset(dfa_state);

   for(insn = get_insns(); insn ; insn = NEXT_INSN(insn)) {
   if(!executable_insn_p(insn)) continue;
   
   if(!func_start && GET_MODE(insn) == TImode) state_transition(dfa_state, NULL);

   stalls = state_transition(dfa_state, insn);
   
   if(stalls == 1) {

   state_transition(dfa_state, NULL);
   state_transition(dfa_state, insn);
   }
   if(stalls > 1) {
   while(--stalls) {
   new_insn = emit_insn_before(gen_nop(), insn);
   if(flag_schedule_insns_after_reload) 
PUT_MODE(new_insn, TImode);
   recog_memoized(new_insn);
   state_transition(dfa_state, NULL);
   }
   state_transition(dfa_state, NULL);
   state_transition(dfa_state, insn);
   }
   
   func_start = false;

   }
   dfa_finish();
}
[END]---


But I still saw that the two instructions can be issued in the continuous 
cycles:
[TOP]---
@(insn:TI 48 50 49 (set (reg:SI 32 d0 [134])
@(minus:SI (reg:SI 6 r6 [orig:135 FLAG ] [135])
@(reg:SI 33 d1))) 48 {*subsi3} (insn_list:REG_DEP_TRUE 175 (nil))
@(expr_list:REG_EQUAL (neg:SI (reg:SI 3 r3 [orig:122 D.1804 ] [122]))
@(nil)))  
   sub .m0 d0, r6, d1  @ 48*subsi3/4   [length = 4]

@(insn:TI 49 48 176 (set (reg:SI 32 d0 [136])
@(smax:SI (reg:SI 3 r3 [orig:122 D.1804 ] [122])
@(reg:SI 32 d0 [134]))) 66 {smaxsi3} (insn_list:REG_DEP_TRUE 48 
(insn_list:REG_DEP_TRUE 42 (nil)))
@(expr_list:REG_DEAD (reg:SI 3 r3 [orig:122 D.1804 ] [122])
@(nil)))  
   max .m0 d0, r3, d0  @ 49smaxsi3/2   [length = 4]

[END]---
The destination operand of the `sub' instruction, d0, will be written back in 
the 4th cycle,
and the instruction `max' will use it as source operand (i.e., there is a true 
data dependency).

I figured out that the state_transition() returns -1 when I issuing the `max' 
instruction,
and I figured out it only returns > 0 when "hardware structural hazard" occured.

Are there any solutions for me to insert 4 nops between the 2 insns?
Thanks a lot.




Re: Question of the DFA scheduler

2005-08-11 Thread Ling-hua Tseng

I figured out that the insn_latency(insn1, insn2) is always returning a 
constant value in any states
(i.e., it's a static value, not be determined dynamically).
It seems to cost much time complexity if I impelemented the `nop inserting' in 
the reorg pass.
Because the `nop inserting' algorithm may be writtern as the following pseudo 
code:
   if(LOG_LINK(insn) is not empty) {
   foreach (dep_insn in LOG_LINK(insn)) {
   if(dep type is not true dependency) continue;
   stalls = insn_latency(dep_insn, insn);
   count =  the distance between insn and dep_insn;
   if(stalls > count) emit (stalls - count - 1) NOPs before insn;
   }
   }
The time complexity of this algorithm is O(n sqaure) and maybe highly 
increasing the compilation time.
Are there any better solution for the nop inserting?

Thanks a lot.

- Original Message - 
From: "Richard Sandiford" <[EMAIL PROTECTED]>

To: "Ling-hua Tseng" <[EMAIL PROTECTED]>
Cc: 
Sent: Thursday, August 11, 2005 6:49 PM
Subject: Re: Question of the DFA scheduler



"Ling-hua Tseng" <[EMAIL PROTECTED]> writes:

The destination operand of the `sub' instruction, d0, will be written
back in the 4th cycle, and the instruction `max' will use it as source
operand (i.e., there is a true data dependency).

I figured out that the state_transition() returns -1 when I issuing
the `max' instruction, and I figured out it only returns > 0 when
"hardware structural hazard" occured.


Right.  state_transition just checks for unit hazards, not data hazards.

Instruction dependencies are detected by sched-deps.c and stored in the
instructions' LOG_LINKS.  insn_latency (in insn-attrtab.c) gives the
latency for two dependent instructions.

Richard



Question of 2nd instruction scheduling pass

2005-08-12 Thread Ling-hua Tseng

I'm porting gcc-4.0.1 to a new VLIW architecture.
I figured out that the `insn' and `jump_insn' were grouped together in the 2nd 
sched pass
however there is a `structural hazard' between them.
Such as the following code which generated by gcc -O3 -dP -S code.c:
@(insn:TI 319 315 474 (set (reg/v:SI 7 r7 [orig:107 j.162 ] [107])
@(const_int 0 [0x0])) 14 {*movsi_const} (nil)
@(nil))
   mov .r0 r7, #0  \\  @ 319   *movsi_const/1  [length = 4]
@(jump_insn 474 319 475 (set (pc)   
@(label_ref 229)) 81 {jump} (nil)

@(nil))
   b .L37@ 474   jump[length = 4]

I think that the jump_insn must have the TImode, but it don't.

The insn attribute `type' of "*movsi_const" is called "r_dp" (RISC data 
processing).
The insn attribute `type' of "jump" is called "r_branch" (RISC branch).
The two type of instructions are handled by the same `RISC' function unit 
(called `r0').

My automata-based pipeline description is also written the following code:
(define_insn_reservation "risc_data_processing" 4
 (eq_attr "type" "r_dp")
 "r0")
(define_insn_reservation "risc_branch" 0
 (eq_attr "type" "r_branch")
 "r0")

They should reserved the `r0' (RISC 0) when they're issued.
So they shouldn't be issued in the same cycle however.

Nevertheless, sometimes the jump_insn has the correct mode:
@(insn/f:TI 58 57 59 (set (reg/f:SI 11 r11)
@(plus:SI (reg:SI 13 r13)
@(const_int -4 [0xfffc]))) 45 {*addsi3} (insn_list:REG_DEP_ANTI 
57 (insn_list:REG_DEP_TRUE 56 (nil)))
@(expr_list:REG_DEAD (reg:SI 13 r13)
@(nil)))  
   sub .r0 r11, r13, #4@ 58*addsi3/2   [length = 4]

@(jump_insn:TI 17 59 19 (set (pc)
@(if_then_else (le (reg/v:SI 2 r2 [orig:111 size ] [111])
@(const_int 0 [0x0]))
@(label_ref 37)
@(pc))) 83 {cbranchsi4} (insn_list:REG_DEP_ANTI 56 
(insn_list:REG_DEP_ANTI 58 (insn_list:REG_DEP_ANTI 57 (nil
@(expr_list:REG_BR_PROB (const_int 5000 [0x1388])
@(nil)))  
   b {!C}  .r0 .L11@ 17cbranchsi4/1[length = 4]


I discoverd that the `unconditional branch' is always issued with other insns 
and the `conditional branch' isn't.
Are there any ways to tell GCC that don't group an jump_insn with other insns 
when structural hazard occured?

Thanks a lot.


Question of the suitable time to call `free_bb_for_insn()'

2005-08-13 Thread Ling-hua Tseng

I'm porting the GCC 4.0.2 (2005-08-11 snapshot) to a new VLIW architecture.

I figured out the `free_bb_for_insn()' is called before the reorg pass,
and I would like to use the CFG in the reorg pass for a reason.

The reason is:
   I would like to change flag_schedule_insns_after_reload to 0 by the macro 
OVERRIDE_OPTIONS if it was set,
   and then I would like to call the sched2 pass in some location of the hook 
TARGET_MACHINE_DEPENDENT_REORG.
   Perhaps I will manually do some instruction scheduling in the reorg pass in 
the future.

So I have two questions:
   1. Is it safe to move the line `free_bb_for_insn ();' to the next line of 
`rest_of_handle_machine_reorg ();' ?
   2. If it is safe, would the GCC team like to move it to there for allowing 
other ones can use CFG info
   in the reorg pass?

By the way, I noticed that the ia64 port did something which is similar to mine.
But it do some effort for recoding something before the reorg pass.
Moreover, it's forced to call the `schedule_ebbs()' however(I'd like to call 
`schedule_insns()').

Thanks a lot.


Re: Question of the suitable time to call `free_bb_for_insn()'

2005-08-14 Thread Ling-hua Tseng
I'm sorry that I didn't trace the cfgrtl.c before I posting the question.
Now I see that I can get the info again by calling compute_bb_for_insn().

On Sun, 14 Aug 2005 09:15:49 +0800, Ling-hua Tseng wrote
> I'm porting the GCC 4.0.2 (2005-08-11 snapshot) to a new VLIW architecture.
> 
> I figured out the `free_bb_for_insn()' is called before the reorg 
> pass, and I would like to use the CFG in the reorg pass for a reason.
> 
> The reason is:
> I would like to change flag_schedule_insns_after_reload to 0 by 
> the macro OVERRIDE_OPTIONS if it was set,and then I would like 
> to call the sched2 pass in some location of the hook 
TARGET_MACHINE_DEPENDENT_REORG.
> Perhaps I will manually do some instruction scheduling in the 
> reorg pass in the future.
> 
> So I have two questions:
> 
> 1. Is it safe to move the line `free_bb_for_insn ();' to the 
> next line of `rest_of_handle_machine_reorg ();' ?
> 2. If it is safe, would the GCC team like to move it to there 
> for allowing other ones can use CFG infoin the reorg pass?
> 
> By the way, I noticed that the ia64 port did something which is 
> similar to mine. But it do some effort for recoding something before 
> the reorg pass. Moreover, it's forced to call the `schedule_ebbs()' 
> however(I'd like to call `schedule_insns()').
> 
> Thanks a lot.



Question of pipeline description

2005-08-19 Thread Ling-hua Tseng

I'm porting GCC 4.0.2 to a new VLIW architecture.
There are 10 functions units (2 RISCs and 8 DSPs) in the architecture.
The pipeline stages are: IS, ID(fetch operand), E1(ALU), E2, E3, E4(write back 
to register)
For the circuit area reason, the pipeline forwarding mechanism is not available 
across two different function units.

For example, the two instructions can use pipeline forwarding in order to 
reduce the stall cycles:
   add.r0r2, r3, r4@ the result is generated at the E1 stage
   nop.r0  @ stall 1 cycle
   add.r0r5, r6, r2@ E1 can forward to ID because the two 
instructions use the same function unit

The two instructions cannot use the pipeline forwarding because they used 
difference function units
(.r0 means that the instruction uses RISC0, and .r1 means that the instruction 
uses RISC1):
   add.r0r2, r3, r4@ write back to register at the E4 stage
   nop.r0  @ stall 1 cycle
   nop.r0  @ stall 1 cycle
   nop.r0  @ stall 1 cycle
   add.r1r5, r6, r2@ no forwarding mechanism between two different 
function units

The pipeline description can write the following definition trivially:
(define_query_cpu_unit "r0, r1, d0, d1, d2, d3, d4, d5, d6, d7")

(define_insn_reservation "risc_data_processing" 4
   (and (eq_attr "type" "dp")
   (eq_attr "fu" "risc"))
   "(r0 | r1)")

I set the lantency time to 4 for general cases (without pipeline forwarding).
And then I set a bypass rule for the pipeline forwading case:
(define_bypass 1
   "risc_data_processing" "risc_data_processing, risc_load_word, ...")

It's only correct if the two RISC insns reserved the same RISC function unit.
If the two insns reserved RISC0 and RISC1 respectively, the bypass rule is not 
suitable.
I know that we can use the "guard function" in the (define_bypass ...), but it 
seems to no help for the situation.
The "guard function" cannot know what function units the two insns reserved.

Are there any other solutions for the situation?
Thanks a lot.


NOPs inserting problem in GCC 4.1.x

2006-03-19 Thread Ling-hua Tseng

I'm porting GCC 4.1.1 to a VLIW processor.
The processor couldn't solve any hazards itself so we should insert explicit 
NOPs after insn scheduling.
I have implemented this functionality in the hook 
`TARGET_MACHINE_DEPENDENT_REORG' (pass 52: mach).
Then I noticed that the pass 56 (split3) will eliminate some insns and generate 
new structural/data hazards.
For example, I have the following insns after performing pass 54 (barriers):
(insn 3489 3488 3515 (set (reg:SI 32 d0 [orig:143 D.4405 ] [143])
   (mem/s/j:SI (reg/f:SI 3 r3 [2801]) [0 M0 S4 A32])) 19 {*movsi} 
(insn_list:REG_DEP_TRUE 3488 (insn_list:REG_DEP_ANTI 3534 (n
   (nil))

(insn 3515 3489 3522 (set (reg:QI 5 r5 [2825])
   (reg:QI 5 r5 [2824])) 22 {movqi} (insn_list:REG_DEP_TRUE 3514 (nil))
   (nil))

(insn 3522 3515 5167 (set (reg:QI 8 r8 [2831])
   (reg:QI 4 r4 [2830])) 22 {movqi} (insn_list:REG_DEP_TRUE 3521 (nil))
   (nil))

(insn 5167 3522 3496 (set (reg:SI 11 r11)
   (reg:SI 32 d0 [orig:143 D.4405 ] [143])) 19 {*movsi} (nil)
   (nil))

The pipeline description for insn 3489 is (r0, nothing*4, cross_write).
The pipeline description for insn 5167 is (r0, cross_read, cross_write).
They have been scheduled to avoid reserving `cross_write' at the same time.
Unfortunately, the insn 3515 will be eliminated by a later pass (pass 56: split3) and they will reserve `cross_write' at the same 
time.


Because I need to use the feature of `length' attribute (i.e., use 
get_attr_length() in machine description),
I have to insert NOPs explicitly before performing the pass 58 (shorten) such that the shorten pass can calculate the length of 
insns exactly.

Can I direct move the reorg pass to the under of shorten pass by modifying the 
passes.c?

Thanks a lot.



Re: NOPs inserting problem in GCC 4.1.x

2006-03-19 Thread Ling-hua Tseng

Sorry.

The example of previous post was wrong.
I just corrected it in this post.

I'm porting GCC 4.1.1 to a VLIW processor.
The processor couldn't solve any hazards itself so we should insert explicit 
NOPs after insn scheduling.
I have implemented this functionality in the hook 
`TARGET_MACHINE_DEPENDENT_REORG' (pass 52: mach).
Then I noticed that the pass 56 (split3) will eliminate some insns and generate 
new structural/data hazards.
For example, I have the following insns after performing pass 54 (barriers):

(insn 3489 3488 3515 (set (reg:SI 32 d0 [orig:143 D.4405 ] [143])
   (mem/s/j:SI (reg/f:SI 3 r3 [2801]) [0 M0 S4 A32])) 19 {*movsi} (insn_list:REG_DEP_TRUE 3488 (insn_list:REG_DEP_ANTI 3534 
(nil)))

   (nil))

(insn 3515 3489 3522 (set (reg:QI 5 r5 [2825])
   (reg:QI 5 r5 [2824])) 22 {movqi} (insn_list:REG_DEP_TRUE 3514 (nil))
   (nil))

(insn 3522 3515 7984 (set (reg:QI 8 r8 [2831])
   (reg:QI 4 r4 [2830])) 22 {movqi} (insn_list:REG_DEP_TRUE 3521 (nil))
   (nil))

(insn 7984 3522 5167 (const_int 0 [0x0]) 147 {nop} (nil)
   (nil))

(insn 5167 7984 7683 (set (reg:SI 11 r11)
   (reg:SI 32 d0 [orig:143 D.4405 ] [143])) 19 {*movsi} (nil)
   (nil))


The pipeline description for insn 3489 is (r0, nothing*4, cross_write).
The pipeline description for insn 5167 is (r0, cross_read, cross_write).
They have been scheduled to avoid reserving `cross_write' at the same time.
Unfortunately, the insn 3515 will be eliminated by a later pass (pass 56: split3) and they will reserve `cross_write' at the same 
time.


Because I need to use the feature of `length' attribute (i.e., use 
get_attr_length() in machine description),
I have to insert NOPs explicitly before performing the pass 58 (shorten) such that the shorten pass can calculate the length of 
insns exactly.

Can I direct move the reorg pass to the under of shorten pass by modifying the 
passes.c?

Thanks a lot.





Question of the LOG_LINKS field

2006-07-15 Thread Ling-hua Tseng

I'm porting GCC 4.1.1 to a VLIW architecture.
I have to insert NOP instructions when data dependencies occurred.
So I wrote an algorithm as the following:
foreach(insn in all real insns) {
 foreach(dep_insn in LOG_LINKS(insn)) {
  if(INSN_DELETED_P(dep_insn)) continue;

  stalls = insn_latency(dep_insn, insn);
  distance = cycle_distance(dep_insn, insn);

  if(stalls > distance)
   emit proper NOP instructions before insn;
 }
}
(This algorithm is performed in the hook `TARGET_ASM_FUNCTION_PROLOGUE')

The algorithm is highly dependent on the information of LOG_LINKS(insn).
But I found that there are not any dependecy info for `reload instructions'
because the register allocation pass and reloading pass are peformed after
the first insn scheduling pass.

For example, here are two insns which have true data dependency:
@(insn 25 2343 895 (set (reg:SI 25 rd1)
@(const_int 0 [0x0])) 15 {*movsi_const_dsp} (nil)
@(nil))
   movilc  .m0 rd1, #0 @ 25*movsi_const_dsp/1  [length = 4]
@(insn 895 25 33 (set (mem/c:SI (plus:SI (reg/f:SI 12 fp)
@(const_int -1204 [0xfb4c])) [0 S4 A32])
@(reg:SI 25 rd1)) 16 {*movsi_dsp} (nil)
@(nil))
   stw .r0 rd1, *-fp[#1204]@ 895   *movsi_dsp/13   [length 
= 4]

The `insn 895' is inserted by global register allocation pass.
So its LOG_LINKS field is empty because the insn didn't process by first 
scheduling pass.

Should I write a violent algorithm to scan these data dependencies?
Are there any better solutions for this problem?

Thanks a lot.



problem about generating data section in g++ 4.2

2007-07-19 Thread Ling-hua Tseng

Here is the example program:
==
// test.cxx
#include 
#include 

namespace {
 template 
 struct transformValue {
   size_t operator()(const T &x) const
   {
   return x + 10;
   }
 };
}

extern std::map > *test;
std::map > *test;
==

If I compile this file with g++ 4.2 by the following command:
 g++ -c test.cxx
and then use this command to check symbol:
 nm test.o
I cannot find the global varible `test' in symbol table:
U _ZNKSs4sizeEv
U _ZNKSsixEj
 t _ZSt17__verify_groupingPKcjRKSs
 W _ZSt3minIjERKT_S2_S2_
U __gxx_personality_v0

I can find the symbol `test' by using older version g++, such as g++ 4.1.x or 
3.x:
 r _ZN9__gnu_cxx16__stl_prime_listE
U _ZNKSs4sizeEv
U _ZNKSsixEj
 t _ZSt17__verify_groupingPKcjRKSs
 W _ZSt3minIjERKT_S2_S2_
U __gxx_personality_v0
 B test

The keypoints of this problem are:
 1. struct transformValue<> is defined in anonymous namepsace
 2. transformValue is passed to one of the template argument of std::map<> 
for instantiating it
 3. declare the global variable (should be a pointer) `test' as a external 
linkage object
 4. define the global varible `test' (should be a pointer)

If the variable `test' isn't declared/defined as pointer type, such as:
 extern std::map > test;
 std::map > test;
You can find a similar problem: the symbol `test' will become a `static' local 
variable.
It can be observed by using the tool `nm':
  b test
Its attribute must be `B', but became to `b'.


Re: problem about generating data section in g++ 4.2

2007-07-19 Thread Ling-hua Tseng

Thanks a lot.

I also found a warning message about this if I used the `bridge pattern' (aka 
pimpl idiom) in my code.
For example, here are 2 files:
==
// interface.hxx
#include 

struct TestImpl;

class Test {
public:
   Test();
   ~Test();
   void foo();
private:
   std::auto_ptr pImpl;
};
==
and
==
// TestImpl.cxx
#include 
#include 
#include 
#include "interface.hxx"

using namespace std;

namespace {
 template 
 struct CompareValue {
   size_t operator()(const T &lhs, const T &rhs) const
   {
   return lhs < rhs;
   }
 };
}

struct TestImpl {
   TestImpl() : test(new map >) { }
   void foo();
   map > *test;
};

void TestImpl::foo()
{
   (*test)[1] = 2;
}

Test::Test() : pImpl(new TestImpl)
{

}

Test::~Test()
{
}

void Test::foo()
{
   pImpl->foo();
}
==

When I compiled TestImpl.cxx, I got the following warning message:
 warning: 'TestImpl' has a field 'TestImpl::test' whose type uses the anonymous 
namespace

Although my program can work normally, I'm still awaring of this warning 
message.
Since the bridge pattern is widely used in my huge software project for saving 
the compile-time,
I worried that this change of GCC will break it in the future.

I know GCC provide the visibility attribute, but it's not helpful in this case.
Should I put the definition of  CompareValue<> to other namespaces?

Brian Dessent wrote:

Ling-hua Tseng wrote:


If I compile this file with g++ 4.2 by the following command:
  g++ -c test.cxx
and then use this command to check symbol:
  nm test.o
I cannot find the global varible `test' in symbol table:


This was an intentional change as part of the overhaul of C++ visibility
semantics in 4.2.  The motivation for this aspect of the change comes
about from the realization that anonymous namespaces are implemented by
adding a randomly-generated string to the mangled name so that they're
guaranteed to be unique to their translation unit.  So it would be
impossible or at least extremely cumbersome to actually refer to such a
symbol from another module, and thus giving them hidden visibility just
cuts down on useless indirection and overhead.

http://gcc.gnu.org/gcc-4.2/changes.html
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21581
http://gcc.gnu.org/ml/gcc-patches/2006-06/msg01511.html
http://gcc.gnu.org/ml/gcc-patches/2006-03/msg01322.html

Brian



Re: Templates + Inheritance problem

2007-07-19 Thread Ling-hua Tseng
It's not a bug.
When you are using class template, the `total template specialization' may 
be wrote by someone.
Since the C++ compiler expect anything, you should to use one of the 
following 3 solutions:
  1. use `this->Baz()' instead of `Baz()'.
  2. write `using Foo::Baz();' in the derived class template.
  3. use `Foo::Baz()' instead of `Baz()'.

It's described in item 43 of this book `Effective C++ 3/e'.
(Its title is "Know how to access names in templatized base classes".)

On Thu, 19 Jul 2007 10:26:25 -0500, John Gateley wrote
> Hi, I've found strange behavior, possibly a bug, but more likely a
> problem with my understanding. Here's the code:
> 
> template class Foo {
> public:
>   void Baz() {}
> };
> template class Bar : public Foo {
> public:
>   void Bum() { Baz(); }
> };
> 
> (I know it doesn't make much sense, I reduced the problem
> down to simplest terms). I get an error compiling:
> 
> [EMAIL PROTECTED]:~/tmp$ g++ -c Foo.C 
> Foo.C: In member function 'void Bar::Bum()':
> Foo.C:11: error: there are no arguments to 'Baz' that depend on a 
> template parameter, so a declaration of 'Baz' must be available 
> Foo.C:11: error: (if you use '-fpermissive', G++ will accept your 
> code, but allowing the use of an undeclared name is deprecated)
> 
> Can someone tell me why Baz is undeclared here?
> 
> version:
> [EMAIL PROTECTED]:~/tmp$ g++ -v
> Using built-in specs.
> Target: i486-linux-gnu
> Configured with: ../src/configure -v --enable-languages=c,c++,
> fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-
> system-zlib --libexecdir=/usr/lib --without-included-gettext --
> enable-threads=posix --enable-nls --program-suffix=-4.1 --enable-
> __cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --enable-
> mpfr --enable-checking=release i486-linux-gnu Thread model: posix 
> gcc version 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)
> 
> Thanks,
> 
> j
> 
> -- 
> John Gateley <[EMAIL PROTECTED]>



Re: Templates + Inheritance problem

2007-07-19 Thread Ling-hua Tseng
On Thu, 19 Jul 2007 23:38:50 +0800, Ling-hua Tseng wrote
> It's not a bug.
> When you are using class template, the `total template 
> specialization' may be wrote by someone. Since the C++ compiler 
> expect anything, you should to use one of the following 3 solutions:
  ^^^ cannot expect anything
It's a typo.
Sorry.
>   1. use `this->Baz()' instead of `Baz()'.
>   2. write `using Foo::Baz();' in the derived class template.
>   3. use `Foo::Baz()' instead of `Baz()'.
> 
> It's described in item 43 of this book `Effective C++ 3/e'.
> 
> (Its title is "Know how to access names in templatized base classes".)



Re: Overload resolution compilation error

2007-07-19 Thread Ling-hua Tseng
On Thu, 19 Jul 2007 12:59:09 -0300, Rodolfo Schulz de Lima wrote
> Hi, the code below doesn't compile with gcc-4.2, with the following error:
> 
> test.cpp: In function ‘int main()’:
> test.cpp:19: error: no matching function for call to 
> ‘call()’
> 
> It compiles and runs fine with Visual Studio 2005. I think the 
> compiler should see that if I'm calling the non-templated 'print' 
> function and there's no other non-templated 'print' overload, it 
> should use 'void print()', with no ambiguity.
> 
Since the sub-expression `&print' is not a call expression,
the overload resolution mechanism will not select the non-template version 
first.
And the function templates can be instantiated without <>,
so the C++ compiler is confused.

This problem can be reduced to:
==
void foo() { }
template void foo() { }

int main()
{
&foo;
}
==

It can help you to understand what's happend.



Re: Overload resolution compilation error

2007-07-19 Thread Ling-hua Tseng
On Thu, 19 Jul 2007 14:45:31 -0300, Rodolfo Schulz de Lima wrote
> &print is not a call expression the same way &print<5> isn't, but 
> the latter is resolved correctly.
It's because you have specified it explicitly.
> I cannot see how a template function can be instantiated without <>, 
> since its instantiation needs the template parameters. That's why I 
> think that the compiler shouldn't even consider this situation in 
> overload resolution. With this said, in your exemple the only 
> overload for '&foo' should be 'void foo()'.
The function template `std::make_pair<>()' is an example.
You can directly call it without <>.
Since &print is not a call expression, C++ compilers cannot determine it by 
function arguments.

This problem can be also reduced to this one:
==
void foo() { }
void foo(int) { }

int main()
{
&foo;
}
==

It's the same problem.
The instantiations of a function template can co-exist with non-template 
function, and they're treated as overloaded functions in C++ compilers.
In fact, the overload resolution mechanism is never performed here since 
it's not a call expression.
You can only speicify it explicitly.


Re: Overload resolution compilation error

2007-07-19 Thread Ling-hua Tseng
On Thu, 19 Jul 2007 19:25:38 -0300, Rodolfo Lima wrote
> If I understand this correctly, when we have the following declarations:
> 
> template  void foo() {}
> void foo() {}
> 
> The overload set for "&foo" at first contains all "void foo()"  
> and "void foo()". Then, because of the presence of the latter, the 
> former should be eliminated. In the end, only "void foo()" remains,
>  and we have no ambiguity.
You forgot the hypothesis in the paragraph 1:
"The function
selected is the one whose type matches the target type required in the 
context. The target can be
— an object or reference being initialized (8.5, 8.5.3),
— the left side of an assignment (5.17),
— a parameter of a function (5.2.2),
— a parameter of a user-defined operator (13.5),
— the return value of a function, operator function, or conversion (6.6.3),
— an explicit type conversion (5.2.3, 5.2.9, 5.4), or
— a non-type template-parameter (14.3.2)."

What is the `target' in your program?
The answer is NOTHING.
So the set of overloaded functions is empty at beginning.

The following case can compile by g++:
  void (*fptr)() = &print;
It's because it has the `target'.
Your deduction can apply to this case, but it cannot apply to your example 
code.


Re: Overload resolution compilation error

2007-07-19 Thread Ling-hua Tseng
On Thu, 19 Jul 2007 21:19:09 -0300, Rodolfo Lima wrote
> In my first example, the target type is the type of the address 
> expression, 
It cannot be treated as the target in paragraph 1 of section 13.4 (ISO/IEC 
14882:2003).
Again, here is the list of possible targets:
1. an object or reference being initialized (8.5, 8.5.3),
2. the left side of an assignment (5.17),
3. a parameter of a function (5.2.2),
4. a parameter of a user-defined operator (13.5),
5. the return value of a function, operator function, or conversion (6.6.3),
6. an explicit type conversion (5.2.3, 5.2.9, 5.4), or
7. a non-type template-parameter (14.3.2).

Obviously, {1, 2, 4, 5, 6, 7} are not matched.
Maybe you think that the item 3 is matched.
Unfortunately, it stands for the non-template functions.

BTW, here are two important sentences after the 7 items:
"The overloaded function name can be preceded by the & operator.
An overloaded function name shall not be used without arguments in contexts 
other than those listed."

Here is you original example code:
==
#include 

using namespace std;

void print() { cout << "null" << endl; }
template void print() { cout << i << endl; }
template void print() { cout << i << ' ' << j << endl; }

template  void call(F f)
{
f();
}

int main()
{
//  proper way (according to g++) to call non-templated print
//  call(static_cast(&print));

call(&print);
call(&print<5>);
call(&print<7,6>);
return 0;
}
==

If you want to match the item 3, you have to replace the definition of call<>
() to a non-template function:
==
void call(void (*f)())
{
f();
}
==
And then it can be passed by g++.

The 2nd line of main() which you marked is matched by item 6.
Hence it can also compiled by g++.

Again, the set of overloaded function has never contain anything in your 
original example.
It's because you don't have any targets which are matched to any items of 
list in paragraph 1 of section 13.4.
Maybe you think that it contained the non-template one at begining.
No, it's also not in the set.



Re: Overload resolution compilation error

2007-07-20 Thread Ling-hua Tseng
Rodolfo Schulz de Lima wrote
> Ling-hua Tseng escreveu:
> > Obviously, {1, 2, 4, 5, 6, 7} are not matched.
> > Maybe you think that the item 3 is matched.
> > Unfortunately, it stands for the non-template functions.
> 
> Are you sure that it doesn't include template functions? Because I think 
> it makes sense to consider them too (as Visual Studio does). The point 
> is that non template functions arguments have higher priority than 
> template functions (as specified in paragraph 4), and IMO there's no 
> reason to differentiate between template and non-template functions' 
> argument target, making the latter work and the former not.
Even if the they are accepted, we still have a problem.
The target type is `F',
and it cannot be deduced by template argument deduction mechanism
since this mechanism need to know the type of `&print' for deducing.
Nevertheless, the overload resoluion mechanism need to know what `F' is.

Of course, the C++ standard didn't allow this infinite loop.
The template argument deduction should be done first (maybe succeeded or 
failed).
It's described in paragraph 2 of section 13.4 and section 14.8.8.2.

The template argument deduction is failed in GCC,
since I cannot find the following context in .cxx.003t.original
after removing the two lines `call(&print<5>);' and `call(&print<7,6>);'
from your original example:
==
  ;; Function void call(F) [with F = void (*)()] (_Z4callIPFvvEEvT_)
  ;; enabled by -tree-original

  <>>
  >>;
==

Since the target type cannot be deduced by template argument deduction 
mechanism,
the overload resolution mechanism will not able to select any function names.
It's why I said the overload set is empty.

However, I have tested your example by Comeau C++ compiler in strict C++03 mode.
It's passed to compile without any problems,
so I guess that perhaps you're right.

I'll stop discussing the topic after this reply
since it will be moved to comp.std.c++ after moderator approving it.
I hope that we will able to get a good answer there.