date:20090303

Extracting Function Pointer Information??

2009-03-03 Thread Seema Ravandale

Hi.

Given a function pointer in GIMPLE IR, is there any way to find
address/offset to which function pointd to?

e.g. I have written  a code,
/** C code **/
void foo()
{
 . . . .
}

void (*fptr)(void) = foo;

int main()
{
  . . . . .
}

GIMPLE tree node for fptr would be,  VAR_DECL--*-->
POINTER_TYPE--*-->FUNCTION_TYPE ---*-->
(Star specifies that, deferencing few more fields inside a tree_node)
In Gimple code, I wont see any assignment statement for fptr=foo, as
its a globally initialized.

I was trying to trace GIMPLE data structure where possibly the
information that, fptr points to foo would be stored.

Is there any way to find it out?

- Seema Ravandale

Note: I am working on GCC-4.3.0

Re: Extracting Function Pointer Information??

2009-03-03 Thread Andrew Haley

Seema Ravandale wrote:
> Hi.
> 
> Given a function pointer in GIMPLE IR, is there any way to find
> address/offset to which function pointd to?
> 
> e.g. I have written  a code,
> /** C code **/
> void foo()
> {
>  . . . .
> }
> 
> void (*fptr)(void) = foo;
> 
> int main()
> {
>   . . . . .
> }
> 
> GIMPLE tree node for fptr would be,  VAR_DECL--*-->
> POINTER_TYPE--*-->FUNCTION_TYPE ---*-->
> (Star specifies that, deferencing few more fields inside a tree_node)
> In Gimple code, I wont see any assignment statement for fptr=foo, as
> its a globally initialized.
> 
> I was trying to trace GIMPLE data structure where possibly the
> information that, fptr points to foo would be stored.

Isn't it in the DECL_INITIAL ?

Andrew.

define_peephol2 insn

2009-03-03 Thread M R Swami Reddy


Hello,

I have ported gcc to a 16-bit target. Now problem is, gcc generates wrong code 
with -O1 and above optimization for move and load/store instructions, b using

the 32-bit registers with 16-bit instructions. For ex:
===
move r13, r1 // move 0-15 bit to r1 register
move r13, r0 // move 16-31 bits to r0 register, but this move performs
 // only 0-15 bits to r0 registers, which NOT INTENDED.

To solve the above issue, can I use the "define_peephole2" insn pattern?

[like: movd r13, (r1,r0) // move r13 0-31 bits r1,r0 registers. Note:
r1 and r0 are 16-bits registers]

Please advise.

Any comments or suggestions most welcome.

Thanks in advance.

Thanks
Swami

Matrix multiplication: performance drop

2009-03-03 Thread Yury Serdyuk


Hi !

I have a simple matrix multiplication code:

#include 
#include 
#include 

main ( int argc, char *argv[] )
{
int   i, j, k;
clock_t t1, t2;
double elapsed_time;

intN = 10;

float *a, *b, *c;

if ( argc > 1 )
 N = atoi ( argv [ 1 ] );

a = (float *) malloc ( N * N * sizeof ( float ) );
b = (float *) malloc ( N * N * sizeof ( float ) );
c = (float *) malloc ( N * N * sizeof ( float ) );

for ( i = 0; i < N; i++ )
 for ( j = 0; j < N; j++ ) {
  a [ i * N + j ] = 0.99;
  b [ i * N + j ] = 0.33;
  c [ i * N + j ] = 0.0;
 }

t1 = clock();

for ( i = 0; i < N; i++ )
 for ( j = 0; j < N; j++ )
  for ( k = 0; k < N; k++ )
   c [ i * N + j ] += a [ i * N + k ] * b [ k * N + j ];

t2 = clock();

elapsed_time = ( (double) ( t2 - t1 ) ) / CLOCKS_PER_SEC;
printf ( "%f\n", c [ (N - 1) * N + ( N - 1 ) ] );

printf ( "N = %d Elapsed time = %lf\n", N, elapsed_time );

}

I compile it as
> gcc -O3 ...
and then trying it for different N ( Intel Xeon, 3.00 GHz, 8 Gb memory)
Here are the results:
   N  Time (secs.)
   --
5000.25
5120.86
5200.29
   1000  4.46
   1024  12.5
   1100  6.48
   1500  20.42
   1536  30.43
   1600  21.04
   2000  46.75
   2048  446.61 ( !!! )
   2100   59.80

So for N multiple of 512 there is very strong drop of performance.
The question is - why and how to avoid it ?

In fact, given effect is present for any platforms ( Intel Xeon, 
Pentium, AMD Athlon, IBM Power PC)

and for gcc 4.1.2, 4.3.0.
Moreover, that effect is present for Intel icc compiler also,
but only till to -O2 option. For -O3, there is good smooth performance.
Trying to turn on -ftree-vectorize do nothing:

$ gcc -O3 -ftree-vectorize -ftree-vectorizer-verbose=5 -o mm_float2 
mm_float2.c


mm_float2.c:25: note: not vectorized: number of iterations cannot be 
computed.
mm_float2.c:35: note: not vectorized: number of iterations cannot be 
computed.

mm_float2.c:35: note: vectorized 0 loops in function.



Please, help.

Yury

Re: define_peephol2 insn

2009-03-03 Thread Joern Rennecke


To solve the above issue, can I use the "define_peephole2" insn pattern?


No.  At most you could abuse it to hide the issue some of the time.
You probably have one or more of your target macros / hooks wrong,
e.g. HARD_REGNO_NREGS.


Any comments or suggestions most welcome.


read and understand all the documentation on porting gcc in the doc
directory.  Or work with someone who has.

Re: Matrix multiplication: performance drop

2009-03-03 Thread Michael Meissner

On Tue, Mar 03, 2009 at 03:44:35PM +0300, Yury Serdyuk wrote:
> So for N multiple of 512 there is very strong drop of performance.
> The question is - why and how to avoid it ?
> 
> In fact, given effect is present for any platforms ( Intel Xeon, 
> Pentium, AMD Athlon, IBM Power PC)
> and for gcc 4.1.2, 4.3.0.
> Moreover, that effect is present for Intel icc compiler also,
> but only till to -O2 option. For -O3, there is good smooth performance.
> Trying to turn on -ftree-vectorize do nothing:

Basically at higher N's you are thrashing the cache.  Computers tend to have
prefetching for sequential access, but accessing one matrix on a column basis
does not fit into that prefetching.  Given caches are a fixed size, sooner or
later the whole matrix will not fit in the cache.  If you have to go out to
main memory, it can cause the processor to take a long time as it waits for the
memory to be betched.  It may be the Intel compiler has better support for
handling matrix multiply.

The usual way is to recode your multiply so that it is more cache friendly.
This is an active research topic, so using google or other search enginee is
your friend.  For instance, this was one of the first links I found with
looking for 'matrix multiple cache'
http://www.cs.umd.edu/class/fall2001/cmsc411/proj01/cache/index.html

-- 
Michael Meissner, IBM
4 Technology Place Drive, MS 2203A, Westford, MA, 01886, USA
meiss...@linux.vnet.ibm.com

Re: Matrix multiplication: performance drop

2009-03-03 Thread Joern Rennecke


Gcc usage questions should go to the gcc-help mailing list.

Questions about computer architecture should go to another forum,
like comp.arch, but you should check first if they explain
caches, DRAM pages, and memory hierarchies in general in an FAQ.

If you had a constructive proposal how to improve loop
optimizations, that would be on-topic for this list.

Re: query automaton

2009-03-03 Thread Vladimir Makarov

Alex Turjan wrote:
Dear Vladimir,

Not really. There is no requirement for "the units
part of the alternatives of a reservation must belong to the
same automaton". Querying should also work in this
case because function cpu_unit_reservation_p checks all
automata for an unit reservation.

Indeed it checks all automata but Im afraid that according to my pipeline
description this check is not enough to guarantee a correct scheduling
decision, e.g.,
suppose the following insn reservation:

(define_reservation "move" "( (unit1_aut1, unit1_aut2) | (*)
(unit2_aut1, unit2_aut2) )

,where unitN_autM refers to unit N from automata M. In this case there are 2 automata.
Now supose a scheduling state S made of the individual states of the two automatons S=. According to what I see happening in insn-automata.c (and target.dfa), from S_aut1 there is a transition for unit1_aut1 and from S_aut2 there is a transition for unit2_aut2.

It seems that the automata do not communicate with each other. As a consequence,
A scheduling decision which results in the resource reservation
(unit1_aut1, unit2_aut2) would not be rejected, while it should.

In my opinion, the current implementation sees the reservation defined in (*)

As equivalent to the following one
(define_reservation "move" "( (unit1_aut1| unit2_aut1) ,
(unit1_aut2| unit2_aut2) )
Which does not seem true to me.

Is there a way for automatons to communicate so that the alternative
(unit1_aut1, unit2_aut2) would be rejected?

Last two days, I've been working on this issue. I found that you are
right, genautomata permits to generate incorrect automata although I did
not find that it is a case for major current description which I checked.

I found the issue is complicated. Genuatomata has already a check for
correct automata generations in check_regexp_units_distribution but that
is not enough. I am working on formulation of general rules for correct
automata generation and implementation of their check. I think it will
take a week or two to do this and check how does it work for all current
automata description.

Alex, thanks for pointing out this issue to me.

Why are these two functions compiled differently?

2009-03-03 Thread Bingfeng Mei

Hello,
I came across the following example and their .final_cleanup files. To me, both 
functions should produce the same code. But tst1 function actually requires two 
extra sign_extend instructions compared with tst2. Is this a C semantics thing, 
or GCC mis-compile (over-conservatively) in the first case.

Cheers,
Bingfeng Mei
Broadcom UK

 
#define A  255

int tst1(short a, short b){
  if(a > (b - A))
return 0;
  else
return 1;  

}


int tst2(short a, short b){
  short c = b - A;
  if(a > c)
return 0;
  else
return 1;  

}


.final_cleanup
;; Function tst1 (tst1)

tst1 (short int a, short int b)
{
:
  return (int) b + -254 > (int) a;

}



;; Function tst2 (tst2)

tst2 (short int a, short int b)
{
:
  return (short int) ((short unsigned int) b + 65281) >= a;

}

Re: Why are these two functions compiled differently?

2009-03-03 Thread Richard Guenther

On Tue, Mar 3, 2009 at 4:06 PM, Bingfeng Mei  wrote:
> Hello,
> I came across the following example and their .final_cleanup files. To me, 
> both functions should produce the same code. But tst1 function actually 
> requires two extra sign_extend instructions compared with tst2. Is this a C 
> semantics thing, or GCC mis-compile (over-conservatively) in the first case.

Both transformations are already done by the fronted (or fold), likely
shorten_compare is quilty for tst1 and fold_unary for tst2 (which
folds (short)((int)b - (int)A).

Richard.

> Cheers,
> Bingfeng Mei
> Broadcom UK
>
>
> #define A  255
>
> int tst1(short a, short b){
>  if(a > (b - A))
>    return 0;
>  else
>    return 1;
>
> }
>
>
> int tst2(short a, short b){
>  short c = b - A;
>  if(a > c)
>    return 0;
>  else
>    return 1;
>
> }
>
>
> .final_cleanup
> ;; Function tst1 (tst1)
>
> tst1 (short int a, short int b)
> {
> :
>  return (int) b + -254 > (int) a;
>
> }
>
>
>
> ;; Function tst2 (tst2)
>
> tst2 (short int a, short int b)
> {
> :
>  return (short int) ((short unsigned int) b + 65281) >= a;
>
> }
>
>
>
>

RE: Why are these two functions compiled differently?

2009-03-03 Thread Bingfeng Mei

Should I file a bug report? If it is not a C semantics thing, GCC certainly 
produces unnecessarily big code. 

.file   "tst.c"
.text
.p2align 4,,15
.globl tst1
.type   tst1, @function
tst1:
.LFB0:
.cfi_startproc
movswl  %si,%esi
movswl  %di,%edi
xorl%eax, %eax
subl$254, %esi
cmpl%edi, %esi
setg%al
ret
.cfi_endproc
.LFE0:
.size   tst1, .-tst1
.p2align 4,,15
.globl tst2
.type   tst2, @function
tst2:
.LFB1:
.cfi_startproc
subw$255, %si
xorl%eax, %eax
cmpw%di, %si
setge   %al
ret
.cfi_endproc
.LFE1:
.size   tst2, .-tst2
.ident  "GCC: (GNU) 4.4.0 20090218 (experimental) [trunk revision 
143368]"
.section.note.GNU-stack,"",@progbits

> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com] 
> Sent: 03 March 2009 15:16
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org; John Redford
> Subject: Re: Why are these two functions compiled differently?
> 
> On Tue, Mar 3, 2009 at 4:06 PM, Bingfeng Mei 
>  wrote:
> > Hello,
> > I came across the following example and their 
> .final_cleanup files. To me, both functions should produce 
> the same code. But tst1 function actually requires two extra 
> sign_extend instructions compared with tst2. Is this a C 
> semantics thing, or GCC mis-compile (over-conservatively) in 
> the first case.
> 
> Both transformations are already done by the fronted (or fold), likely
> shorten_compare is quilty for tst1 and fold_unary for tst2 (which
> folds (short)((int)b - (int)A).
> 
> Richard.
> 
> > Cheers,
> > Bingfeng Mei
> > Broadcom UK
> >
> >
> > #define A  255
> >
> > int tst1(short a, short b){
> >  if(a > (b - A))
> >    return 0;
> >  else
> >    return 1;
> >
> > }
> >
> >
> > int tst2(short a, short b){
> >  short c = b - A;
> >  if(a > c)
> >    return 0;
> >  else
> >    return 1;
> >
> > }
> >
> >
> > .final_cleanup
> > ;; Function tst1 (tst1)
> >
> > tst1 (short int a, short int b)
> > {
> > :
> >  return (int) b + -254 > (int) a;
> >
> > }
> >
> >
> >
> > ;; Function tst2 (tst2)
> >
> > tst2 (short int a, short int b)
> > {
> > :
> >  return (short int) ((short unsigned int) b + 65281) >= a;
> >
> > }
> >
> >
> >
> >
> 
>

Re: Why are these two functions compiled differently?

2009-03-03 Thread Ian Lance Taylor

"Bingfeng Mei"  writes:

> #define A  255
>
> int tst1(short a, short b){
>   if(a > (b - A))
> return 0;
>   else
> return 1;  
>
> }
>
>
> int tst2(short a, short b){
>   short c = b - A;
>   if(a > c)
> return 0;
>   else
> return 1;  
>
> }

These computations are different.  Assume short is 16 bits and int is 32
bits.  Consider the case of b == 0x8000.  The tst1 function will sign
extend that to 0x8000, subtract 255, sign extend a to int, and
compare the sign extended a with 0x7f01.  The tst2 function will
compute 0x8000 - 255 as a 16 bit value, getting 0x7f01.  It will sign
extend that to 0x7f01, and compare the sign extended a with 0x7f01.

It may seem that there is an undefined signed overflow when using this
value, but there actually isn't.  All computations in C are done in type
int.  So the computations have no overflow.  The truncation from int to
short is not undefined, it is implementation defined.

Ian

Re: Constant folding and Constant propagation

2009-03-03 Thread Adam Nemet

Adam Nemet  writes:
> I am actually looking at something similar for PR33699 for MIPS.  My plan is
> to experiment extending cse.c with putting "anchor" constants to the available
> expressions along with the original constant and then querying those later for
> constant expressions.

See http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00161.html.

Adam

Re: load large immediate

2009-03-03 Thread daniel tian

2009/3/2 daniel tian :
> 2009/2/27 daniel tian :
>> 2009/2/27 Dave Korn :
>>> daniel tian wrote:
>>>
 That seems to solving a address mode problem. My problem is that while
 loading a large immediate data or SYMBOL_REF,  the destination is a
 specified general register (register 0:R0). So I don't how to let the
 define_expand "movsi" pattern to generate destination register in R0.
>>>
>>>  Well, the RTL that you emit in your define_expand has to match an insn
>>> pattern in the end, so you could make an insn for it that uses a predicate 
>>> and
>>> matching constraint to enforce only accepting r0.  If you use a predicate 
>>> that
>>> only accepts r0 you'll get better codegen than if you use a predicate that
>>> accepts general regs and use an r0-only constraint to instruct reload to 
>>> place
>>> the operand in r0.
>>
>> Well, I have already done this. There is insn pattern that the
>> predicate limits the operand in R0. But if in define_expand "movsi" do
>> not put the register in R0, the compiler will  crashed because of the
>> unrecognized RTL(load big immediate or Symbol). Like the below:
>> (define_insn "load_imm_big"
>>        [(set (match_operand:SI 0 "zero_register_operand" "=r")
>>              (match_operand:SI 1 "rice_imm32_operand" "i"))
>>              (clobber (reg:SI 0))]
>>        "TARGET_RICE"
>>        {
>>                return rice_output_move (operands, SImode);
>>        }
>> )
>>
>> PS:rice_output_move  is function to output assemble code.
>>
>> Thanks.
>>
>
> Hello, Dave, Rennecke :
>      I defined the PREFERRED_RELOAD_CLASS macro, but this cc1 doesn't
> go through it.
>      #define PREFERRED_RELOAD_CLASS(X, CLASS)
> rice_preferred_reload_class(X, CLASS)
>
>      And rice_preferred_reload_class(X, CLASS) is:
>
>      enum reg_class rice_preferred_reload_class (rtx x, enum reg_class class)
>     {
>        printf("Come to rice_preferred_reload_class! \n");
>
>        if((GET_CODE(x) == SYMBOL_REF) ||
>           (GET_CODE(x) == LABEL_REF) ||
>           (GET_CODE(x) == CONST) ||
>           ((GET_CODE(x) == CONST_INT) && (!Bit10_constant_operand(x, 
> GET_MODE(x)
>        {
>                return R0_REG;
>        }
>        return class;
>      }
>
>     I run the cc1 file with command "./cc1 test_reload.c  -Os". But
> the gcc never goes through this function. Did I missed something?
>     Thank you very much.
>

Hello, I resolved the problem. The key is the macro
LEGITIMIZE_RELOAD_ADDRESS, and GO_IF_LEGITIMATE_ADDRESS, and
PREFERRED_RELOAD_CLASS. And the predicate "zero_register_operand".
Here is the thing I learnt. If I did something wrong,  let me know. :)

LEGITIMIZE_RELOAD_ADDRESS: in this macro, I should push the invalid
the rtx like large immeidate, symbol_rel, label_rel, into the function
push_reload which will call macro PREFERRED_RELOAD_CLASS.

GO_IF_LEGITIMATE_ADDRESS: this macro will have two mode, strict or
non_strict, the former used in reload process, the latter used in
before reload. I made a mistake in this macro, defined the two mode
exactly in the same way( I defined the large immeidate, SYMBOL_REL,
LABEL_REL, CONST is valid in two mode ). So it nerver call macro
LEGITIMIZE_RELOAD_ADDRESS. The function find_reloads_address in
reloads.c will call GO_IF_LEGITIMATE_ADDRESS first, if rtx is a valid
address, it will nerver goto macro LEGITIMIZE_RELOAD_ADDRESS.

PREFERRED_RELOAD_CLASS: this macro is in function push_reload which
was mentioned above.

and the predicate "zero_register_operand", I defined the rtx with the
register NO. being 0. So Everytime, cc1 will abort with information
like "rtl unrecognize". Because pseudo register are allocated before
reload. I revised the predicate which register NO should be 0 or
pseudo.

Now It is Ok. But it still has some redundency code. I will keep going.
Your guys are so kind. Thank you very much.
Thank you for your help again!

Best Regards!

  Daniel.Tian

Re: define_peephol2 insn

2009-03-03 Thread M R Swami Reddy


To solve the above issue, can I use the "define_peephole2" insn pattern?


No.  At most you could abuse it to hide the issue some of the time.
You probably have one or more of your target macros / hooks wrong,
e.g. HARD_REGNO_NREGS.


Thank you very much for your reply. In my case, code generation is correct
for all test cases without optimization option. Wrong code generated (as 
mentioned in previous mail) for rare case with optimization + PIC options (ie 
-O1 -fPIC) enabled only.


Thanks
Swami

ANNOUNCEMENT: Generic Data Flow Analyzer for GCC

2009-03-03 Thread Seema Ravandale

Announcement: gdfa - Generic data flow analyzer for GCC.
Developed by: GCC resource center, IITB

Patch and the Documentation can be found at the below link,

http://www.cse.iitb.ac.in/grc/gdfa.html


Ms. Seema S. Ravandale
Project Engg,
GCC Resource Center
Department of Computer Science & Engg.
IIT Bombay, Powai, Mumbai 400 076, India.
email - se...@cse.iitb.ac.in

Extracting Function Pointer Information??

Re: Extracting Function Pointer Information??

define_peephol2 insn

Matrix multiplication: performance drop

Re: define_peephol2 insn

Re: Matrix multiplication: performance drop

Re: Matrix multiplication: performance drop

Re: query automaton

Why are these two functions compiled differently?

Re: Why are these two functions compiled differently?

RE: Why are these two functions compiled differently?

Re: Why are these two functions compiled differently?

Re: Constant folding and Constant propagation

Re: load large immediate

Re: define_peephol2 insn

ANNOUNCEMENT: Generic Data Flow Analyzer for GCC

16 matches

Site Navigation

Mail list logo

Footer information