GIV optimizations

2005-02-28 Thread Canqun Yang
Hi, all

The new loop unroller causes performance degradation 
due to the unimplemented giv (general induction 
variable) optimizations. 

When will it be implemented? 

Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


Re: GIV optimizations

2005-02-28 Thread Canqun Yang
Hi, Gr.Steven

Very thanks for your replyment.

Induction variables are variables whose successive 
values form an arithmetic progression over a loop. 
Induction variables are often divided into bivs (basic 
induction variables), which are explicitly modified by 
the same constant amount during each iteration of a 
loop, and gives (general induction variables), which 
may be modified or computed by a linear-function of a 
basic induction variable. 

There are three important transformations that apply 
to them: strength reduction, induction-variable 
removal, and linear-function test replacement. For 
example, we can do strength reduction of address givs 
which are usually used for address calculation of 
array elements. On platforms with post-increment load 
and store instructions, this will make chance to 
combine a load/store with the following address 
calculation instruction.

Also the induction variable splitting is an effective 
optimization during loop unrolling.

The new loop optimizer only support a limited gives 
analysis (ref. loop-iv.c), and has not yet implemented 
giv splitting (ref. ¡®loop-unroll.c¡¯, 
¡®analyze_iv_to_split_insn¡¯, and comments in this 
function, ¡°For now we just split the basic induction 
variables. Later this may be extended for example by 
selecting also addresses of memory references.¡±)

I test 171.swim on IA64 system with 1GHz itanium2 CPU. 
After implemented or improved/adjusted several compile 
optimizations, such as Fortran alias analysis (very 
simple one), loop unrolling (the old one), loop arrays 
pre-fetching, and giv optimizations, it costs just 
9.1s (28s for GCC-4.0.0) to execute the train mode of 
171.swim. But, apply those changes on current GCC-
4.0.0 (the old loop unroller was removed), it costs 
13.4 to execute this benchmark program, and I found 
that giv optimizations are the major factor of such 
performance degradation.

Giv optimizations are just features which not 
implemented yet in the new loop unroller, so I think 
put it in bugzilla is not appropriate.

Steven Bosscher <[EMAIL PROTECTED]>:

> On Feb 28, 2005 02:35 PM, Canqun Yang 
<[EMAIL PROTECTED]> wrote:
>
> > Hi, all
> >
> > The new loop unroller causes performance 
degradation
> > due to the unimplemented giv (general induction
> > variable) optimizations.
> >
> > When will it be implemented?
>
> Will you be more specific so we can have a clue what 
you are
> talking about? Filing bugs in bugzilla showing the 
problem
> would help.
>
> Gr.
> Steven
>
> 



Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


Re: GIV optimizations

2005-03-03 Thread Canqun Yang
Zdenek Dvorak <[EMAIL PROTECTED]>:

> Hello,
>
> > Giv optimizations are just features which not
> > implemented yet in the new loop unroller, so I 
think
> > put it in bugzilla is not appropriate.
>
> it most definitely is appropriate.  This is a 
performance
> regression.  Even if it would not be, feature 
requests
> can be put to Bugzilla.
>
Ok, thanks.

> The best of course would be if you could create a 
small testcase
> demonstrating what you would like the compiler to 
achieve.
>
> Zdenek
> 

I attached a testcase with the two assembly code 
versions, one for which has address giv splitting in 
the loop unroller, the 
other not.


Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


giv.f90
Description: Binary data


giv_no_opt.s
Description: Binary data


giv_opt.s
Description: Binary data


[rtl-optimization] Improve Data Prefetch for IA-64

2005-03-25 Thread Canqun Yang
Hi, all

Currently, GCC just ignores all data prefetches within 
loop when the number of prefetches exceeds 
SIMULTANEOUS_PREFETCHES. It isn't advisable. 

Also, macros defined in ia64.h for data prefetching 
are too small.

This patch modified the data prefetch algorithm 
defined in loop.c and redefines some macros in ia64.h 
accordingly. The test shows 2.5 percent perfomance 
improvements is gained for SPEC CFP2000 benchmarks on 
IA-64. If the new loop unroller was perfectly (just 
like the old one which was removed) implemented, much 
more performance improvements would be gained.

Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.
2005-03-25  Canqun Yang  <[EMAIL PROTECTED]>

* ia64.c (SIMULTANEOUS_PREFETCHES): Redefine as 18.
(PREFETCH_BLOCK): Redefine as 64.
(PREFETCH_BLOCKS_BEFORE_LOOP_MAX): New definition.

2005-03-25  Canqun Yang  <[EMAIL PROTECTED]>

* loop.c (PREFETCH_BLOCKS_BEFORE_LOOP_MAX): Defined conditionally.
(scan_loop): Change extra_size from 16 to 128.
(emit_prefetch_instructions): Don't ignore all prefetches within loop.Index: loop.c
===
RCS file: /cvs/gcc/gcc/gcc/loop.c,v
retrieving revision 1.522
diff -c -3 -p -r1.522 loop.c
*** loop.c  17 Jan 2005 08:46:15 -  1.522
--- loop.c  25 Mar 2005 12:03:44 -
*** struct loop_info
*** 434,440 
--- 434,442 
  #define MAX_PREFETCHES 100
  /* The number of prefetch blocks that are beneficial to fetch at once before
 a loop with a known (and low) iteration count.  */
+ #ifndef PREFETCH_BLOCKS_BEFORE_LOOP_MAX
  #define PREFETCH_BLOCKS_BEFORE_LOOP_MAX  6
+ #endif
  /* For very tiny loops it is not worthwhile to prefetch even before the loop,
 since it is likely that the data are already in the cache.  */
  #define PREFETCH_BLOCKS_BEFORE_LOOP_MIN  2
*** scan_loop (struct loop *loop, int flags)
*** 1100,1106 
/* Allocate extra space for REGs that might be created by load_mems.
   We allocate a little extra slop as well, in the hopes that we
   won't have to reallocate the regs array.  */
!   loop_regs_scan (loop, loop_info->mems_idx + 16);
insn_count = count_insns_in_loop (loop);
  
if (loop_dump_stream)
--- 1102,1108 
/* Allocate extra space for REGs that might be created by load_mems.
   We allocate a little extra slop as well, in the hopes that we
   won't have to reallocate the regs array.  */
!   loop_regs_scan (loop, loop_info->mems_idx + 128);
insn_count = count_insns_in_loop (loop);
  
if (loop_dump_stream)
*** emit_prefetch_instructions (struct loop 
*** 4398,4406 
{
  if (loop_dump_stream)
fprintf (loop_dump_stream,
!"Prefetch: ignoring prefetches within loop: ahead is zero; 
%d < %d\n",
 SIMULTANEOUS_PREFETCHES, num_real_prefetches);
! num_real_prefetches = 0, num_real_write_prefetches = 0;
}
  }
/* We'll also use AHEAD to determine how many prefetch instructions to
--- 4400,4411 
{
  if (loop_dump_stream)
fprintf (loop_dump_stream,
!"Prefetch: ignoring some prefetches within loop: ahead is 
zero; %d < %d\n",
 SIMULTANEOUS_PREFETCHES, num_real_prefetches);
! num_real_prefetches = MIN (num_real_prefetches,
!SIMULTANEOUS_PREFETCHES);
! num_real_write_prefetches = MIN (num_real_write_prefetches,
!  SIMULTANEOUS_PREFETCHES);
}
  }
/* We'll also use AHEAD to determine how many prefetch instructions to
Index: config/ia64/ia64.h
===
RCS file: /cvs/gcc/gcc/gcc/config/ia64/ia64.h,v
retrieving revision 1.194
diff -c -3 -p -r1.194 ia64.h
*** config/ia64/ia64.h  17 Mar 2005 17:35:16 -  1.194
--- config/ia64/ia64.h  25 Mar 2005 12:05:05 -
*** do {
\
*** 1993,2004 
 ??? This number is bogus and needs to be replaced before the value is
 actually used in optimizations.  */
  
! #define SIMULTANEOUS_PREFETCHES 6
  
  /* If this architecture supports prefetch, define this to be the size of
 the cache line that is prefetched.  */
  
! #define PREFETCH_BLOCK 32
  
  #define HANDLE_SYSV_PRAGMA 1
  
--- 1993,2008 
 ??? This number is bogus and needs to be replaced before the value is
 actually used in optimizations.  */
  
! #define SIMULTANEOUS_PREFETCHES 18
  
  /* If this architecture supports prefetch, define this to be the size of
 the cache line that is prefetched.  */
  
! #define PREFETCH_BLOCK 64 
! 
! /* The number of prefetch blo

Re: [rtl-optimization] Improve Data Prefetch for IA-64

2005-03-26 Thread Canqun Yang
The last ChangeLog of rtlopt-branch was written in 
2003. After more than one year, many impovements in 
this branch haven't been put into the GCC HEAD. Why? 

ÒýÑÔ Steven Bosscher <[EMAIL PROTECTED]>:

> On Saturday 26 March 2005 02:22, Canqun Yang wrote:
> >         * loop.c (PREFETCH_BLOCKS>
> _BEFORE_LOOP_MAX): Defined conditionally.
> >         (scan_loop): Change extra>
> _size from 16 to 128.
> >         (emit_prefetch_instructio>
> ns): Don't ignore all prefetches within
> > loop.
>
> OK, so I know this is not a popular subject, but can 
we *please* stop
> working on loop.c and focus on getting the new RTL 
and tree loop passes
> to do what we want?  All this loop.c patching is a 
typical example of
> why free software development does not always work: 
always going for
> the low-hanging fruit.  In this case, there have 
been several attempts
> to replace the prefetching stuff in loop.c with 
something better.  On
> the rtl-opt branch there is a new RTL loop-
prefetch.c, and on the LNO
> branch there is a re-use analysis based prefetching 
pass.  Why don't
> you try to finish and improve those passes, instead 
of making it yet
> again harder to remove loop.c.  This one file is a 
*huge* problem for
> just about the entire RTL optimizer path.  It is, 
for example, the
> reason why there is no profile information available 
before this old
> piece of, if I may say, junk runs, and it the only 
reason why a great
> many functions in for example jump.c and the various 
cfg*.c files can
> still not be removed.
>
> Gr.
> Steven
>
> 



Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


Re: [rtl-optimization] Improve Data Prefetch for IA-64

2005-03-26 Thread Canqun Yang
ÒýÑÔ Steven Bosscher <[EMAIL PROTECTED]>:

> On Sunday 27 March 2005 03:53, Canqun Yang wrote:
> > The last ChangeLog of rtlopt-branch was written in
> > 2003. After more than one year, many impovements in
> > this branch haven't been put into the GCC HEAD. 
Why?
>
> Almost all of the rtlopt branch was merged.  
Prefetching is one
> of the few things that was not, probably Zdenek 
knows why.
>

Another question is why the new RTL loop-unroller does 
not support giv splitting. It is very usefull 
according to my tests for the old one. Is there anyone 
plan to implement it? The writter of the new loop-
unroller or someone who is familiar with that part. 
They will carry out it better and faster, I think.

> Gr.
> Steven
> 


Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


Re: [rtl-optimization] Improve Data Prefetch for IA-64

2005-03-27 Thread Canqun Yang
ÒýÑÔ Zdenek Dvorak <[EMAIL PROTECTED]>:

> Hello,
>
> > On Sunday 27 March 2005 03:53, Canqun Yang wrote:
> > > The last ChangeLog of rtlopt-branch was written 
in
> > > 2003. After more than one year, many impovements 
in
> > > this branch haven't been put into the GCC HEAD. 
Why?
> >
> > Almost all of the rtlopt branch was merged.  
Prefetching is one
> > of the few things that was not, probably Zdenek 
knows why.
>
> because I never made it work as well as the current 
version,
> basically.  At least no matter how much I tried, I 
never produced
> any benchmark numbers that would justify the 
change.  Then I went
> to tree-ssa (and tried to write prefetching there, 
and got stuck on
> the same issue, after which I simply forgot due to 
loads of other work
> :-( ).
>
> Zdenek
> 

It shoud be the similar reason as comments in my 
previously supplied patch. 

http://gcc.gnu.org/ml/gcc-patches/2005-03/msg02400.html
.

Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


Re: [rtl-optimization] Improve Data Prefetch for IA-64

2005-03-27 Thread Canqun Yang
ÒýÑÔ Zdenek Dvorak <[EMAIL PROTECTED]>:

> Hello,
>
> > On Sunday 27 March 2005 04:45, Canqun Yang wrote:
> > > Another question is why the new RTL loop-
unroller does
> > > not support giv splitting.
> >
> > Apparently because for most people it is not a
problem that it does
> > not do it, and while you have indicated earlier
that it may be useful
> > for you, you have neither tried to implement it
yourself, nor provided
> > a test case to PR20376.
> >
> > FWIW you could try -fweb and see if it does what
you want.  And if it
> > does, you could write a limited webizer patch that
works on just loop
> > bodies after unrolling.
>
> from what I understood, change to
analyze_iv_to_split_insn that would
> detect memory references should suffice.  Also maybe
this patch might be
> relevant:
>
> http://gcc.gnu.org/ml/gcc-patches/2004-
10/msg01176.html
>
> Zdenek
>

I'll try this. Thanks a lot.


Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


RE: SMS in gcc4.0

2005-03-31 Thread Canqun Yang
Hi, all

This patch will fix doloop_register_get defined in 
modulo-sched.c, and let the program of PI caculation 
on IA-64 be successfully modulo scheduled. On 1GHz 
Itanium-2, it costs just 3.128 seconds to execute when 
compiled with "-fmodulo-shced -O3" turned on, while 
5.454 seconds whithout "-fmodulo-sched".


2005-03-31  Canqun Yang  <[EMAIL PROTECTED]>

* modulo-sched.c (doloop_register_get): Deal 
with if_then_else pattern.  


Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


pi.f90
Description: Binary data
*** /home/ycq/mainline/gcc/gcc/modulo-sched.c   Mon Mar 21 10:49:23 2005
--- modulo-sched.c  Thu Mar 31 21:11:08 2005
*** static rtx
*** 263,269 
  doloop_register_get (rtx insn, rtx *comp)
  {
rtx pattern, cmp, inc, reg, condition;
! 
if (!JUMP_P (insn))
  return NULL_RTX;
pattern = PATTERN (insn);
--- 263,270 
  doloop_register_get (rtx insn, rtx *comp)
  {
rtx pattern, cmp, inc, reg, condition;
!   rtx src;
!   
if (!JUMP_P (insn))
  return NULL_RTX;
pattern = PATTERN (insn);
*** doloop_register_get (rtx insn, rtx *comp
*** 293,303 
  
/* Extract loop counter register.  */
reg = SET_DEST (inc);
  
/* Check if something = (plus (reg) (const_int -1)).  */
!   if (GET_CODE (SET_SRC (inc)) != PLUS
!   || XEXP (SET_SRC (inc), 0) != reg
!   || XEXP (SET_SRC (inc), 1) != constm1_rtx)
  return NULL_RTX;
  
/* Check for (set (pc) (if_then_else (condition)
--- 294,315 
  
/* Extract loop counter register.  */
reg = SET_DEST (inc);
+   src = SET_SRC (inc);
  
+   /* On IA-64, the RTL pattern of SRC is just like this 
+ (if_then_else:DI (ne (reg:DI 332 ar.lc)
+ (const_int 0 [0x0]))
+ (plus:DI (reg:DI 332 ar.lc)
+ (const_int -1 [0x]))
+ (reg:DI 332 ar.lc))  */
+ 
+   if (GET_CODE (src) == IF_THEN_ELSE)
+ src = XEXP (src, 1);
+   
/* Check if something = (plus (reg) (const_int -1)).  */
!   if (GET_CODE (src) != PLUS
!   || XEXP (src, 0) != reg
!   || XEXP (src, 1) != constm1_rtx)
  return NULL_RTX;
  
/* Check for (set (pc) (if_then_else (condition)
*** doloop_register_get (rtx insn, rtx *comp
*** 318,324 
   if ((GET_CODE (condition) != GE && GET_CODE (condition) != NE)
 || GET_CODE (XEXP (condition, 1)) != CONST_INT).  */
if (GET_CODE (condition) != NE
!   || XEXP (condition, 1) != const1_rtx)
  return NULL_RTX;
  
if (XEXP (condition, 0) == reg)
--- 330,337 
   if ((GET_CODE (condition) != GE && GET_CODE (condition) != NE)
 || GET_CODE (XEXP (condition, 1)) != CONST_INT).  */
if (GET_CODE (condition) != NE
!   || (XEXP (condition, 1) != const1_rtx
! && XEXP (condition, 1) != const0_rtx))
  return NULL_RTX;
  
if (XEXP (condition, 0) == reg)


Re: [rtl-optimization] Improve Data Prefetch for IA-64

2005-04-05 Thread Canqun Yang
>On Mon, 28 Mar 2005, James E Wilson wrote:
>> Steven Bosscher wrote:
>>> OK, so I know this is not a popular subject, but 
can we *please* stop
>>> working on loop.c and focus on getting the new RTL 
and tree loop passes
>>> to do what we want?
>> I don't think anyone is objecting to this. [...]
>> I would however make a distinction here between new 
development work and
>> maintenance.  It would be better if new development 
work happened in the new
>> loop optimizer.  However, we still need to do 
maintenance work in loop.c.
>
>...and since Canqun reported 2.5% improvement on SPEC 
CFP2000 on ia64 with 
>his current patch, I really think we should consider 
it.
>

Besides this, I¡¯ve got another patch for improving 
general induction variable optimizations defined in 
loop.c. With these two patches and properly setting 
the loop unrolling parameters, the tests of both NAS 
and SPEC CPU2000 benchmarks on IA-64 1GHz system show 
a good result.

1. The following table shows the test result of NAS 
benchmarks:
Gcc-20050404Gcc-20050404Ratio
+ Optimized.
Bt.W22.16s  22.68s  0.98
Cg.A9.23s   7.45s   1.24
Ep.W12.3s   11.97s  1.03
Ft.A38.41s  25.98s  1.48
Is.B34.94s  33.47s  1.04
Lu.W32.93s  31.59s  1.04
Mg.A21.91s  14.64s  1.50
Sp.W59.71s  55.67s  1.07
Geomean 1.16

"Gcc-20050404" is the GCC mainline version dated on 
April 4, 2005. It includes my previous patch of 
RECORD_TYPE for COMMON blocks without equivalence 
objects. The compile options for ¡°Gcc-20050404¡± is 
¡°-O3 -funroll-loops -fprefetch-loop-arrays¡±, and ¡°-
O3 -funroll-loops -fprefetch-loop-arrays --param max-
unrolled-insns=600 --param max-average-unrolled-
insns=320¡± for¡°Gcc-20050404+Optimized¡±. 

2. The SPEC CFP2000 test uses the same options as 
above.
   ¡°Gcc-20050404¡± got 426 SPEC ratio, and ¡°Gcc-
20050404 + Optimized¡± got 459 SPEC ratio. You can 
download the attachments to see more details. And if 
the address giv splitting were not miss in the new 
loop unroller, the SPEC ratio up to 513 can be 
expected. 

>We all know how hard it is to get this kind of 
improvement on any of the 
>SPECs -- and in fact improving the current optimizers 
will make raise >the
>bar for the new ones. ;-)
>
>Question is: who is going review/potentially approve 
this patch?
>
>Gerald


Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


CFP2000.154.pdf
Description: Adobe PDF document


CFP2000.155.pdf
Description: Adobe PDF document


Re: [rtl-optimization] Improve Data Prefetch for IA-64

2005-04-05 Thread Canqun Yang
Steven Bosscher <[EMAIL PROTECTED]>:

>
> What happens if you use the memory address unrolling 
patch, turn on
> -fweb, and set the unrolling parameters properly?
>

The memory address unrolling patch can't work on IA-
64, and the -fweb can improve the unroller, but still 
far away from the old one. So, I plan to port my work  
on new loop optimizer after Zdenek has commited his 
patches. 

>
> Gr.
> Steven
>
> 

Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


Inline round for IA64

2005-04-07 Thread Canqun Yang
Hi, all

Gfortran translates the Fortran 95 intrinsic DNINT to 
round operation with double precision type argument 
and return value. Inline round operation will speed up 
the SPEC CFP2000 benchmark 189.lucas which contains 
function calls of intrinsic DNINT from 706 (SPEC 
ratio) to 783 on IA64 1GHz system. 

I have implemented the double precison version of 
inline round. If it is worth doing, I can go on to 
finish the other precision mode versions.
 
2005-04-07  Canqun Yang  <[EMAIL PROTECTED]>

* config/ia64/ia64.md (UNSPEC_ROUND): New 
constant.
(floatxfxf2, fix_truncxf2): New instruction 
patterns.
(rounddf2): New expander.
(rounddf2_internal): New 
define_insn_and_split implementing inline
calculation of DFmode round.
* config/ia64/ia64.opt (-minline-round, -mno-
inline-round): Add new
IA64 options.
* doc/invoke.texi: Ditto.


Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


ia64.md.diff
Description: Binary data


invoke.texi.diff
Description: Binary data


Re: Inline round for IA64

2005-04-08 Thread Canqun Yang
Geert Bosch <[EMAIL PROTECTED]>:

> As far as I can seem from this patch, it rounds 
incorrectly.
> This is a problem with the library version as well, 
I believe.
>
> The issue is that one cannot round a positive float 
to int
> by adding 0.5 and truncating. (Same issues with 
negative values
> and subtracting 0.5, of course). This gives an error 
for the
> predecessor of 0.5. The between Pred (0.5) and 0.5 
is half that of
> pred (1.0) and 1.0. So the value of Pred (0.5) + 0.5 
lays exactly
> halfway Pred (1.0) and 1.0. The CPU rounds this 
halfway value to
> even, or 1.0 in this case.
>
> So try 
rounding .499944488848768742172978818416595
4589843750
> using IEEE double on non-x86 platform, and you'll 
see it gets rounded
> to 1.0.

Do you mean the correct value should be 0.0 ?

> A similar  problem exists with large odd integers 
between 2^52+1 and
> 2^53-1,
> where adding 0.5 results in a value exactly halfway 
two integers,
> rounding up to the nearest even integer. So, for 
IEEE double,
> 4503599627370497 would round to 4503599627370498.

Do you mean 4503599627370498 is a wrong result?

>
> These issues can be fixed by not adding/subtracting 
0.5, but Pred (0.5).
> As shown above, this rounds to 1.0 correctly for 
0.5. For larger values
> halfway two integers, the gap with the next higher 
representable number
> will
> only decrease so the result will always be rounded 
up to the next higher
> integer. For this technique to work, however, it is 
necessary that the
> addition will be rounded to the target precision 
according to IEEE
> round-to-even semantics. On platforms such as x86, 
where GCC implicitly
> widens intermediate results for IEEE double, the 
rounding to integer
> should be performed entirely in long double mode, 
using the long double
> predecessor of 0.5.
>
> See ada/trans.c around line 5340 for an example of 
how Ada does this.
>
>-Geert
>
> On Apr 7, 2005, at 05:38, Canqun Yang wrote:
> > Gfortran translates the Fortran 95 intrinsic DNINT 
to
> > round operation with double precision type argument
> > and return value. Inline round operation will 
speed up
> > the SPEC CFP2000 benchmark 189.lucas which contains
> > function calls of intrinsic DNINT from 706 (SPEC
> > ratio) to 783 on IA64 1GHz system.
> >
> > I have implemented the double precison version of
> > inline round. If it is worth doing, I can go on to
> > finish the other precision mode versions.
>
> 

I attached an example for intrinsic DNINT with its 
output. Would you please check it, and tell me whether 
the result is correct.

Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.
! Test case for inline round
subroutine dnint_ex (a, b, n)
   real*8 a(n), b(n)
   integer n
   do i = 1, n
  b(i) = dnint (a(i))
   enddo
end

program round_test
   real*8 a(2), b(2)

   a(:) = (/.4999444888487687421729788184165954589843750_8,&
4503599627370497.0_8/)
   call dnint_ex (a, b, 2)
   write (*,*) b
end
  
The output is:
 
   0.00   4.503599627370497E+015


Re: SMS in gcc4.0

2005-04-21 Thread Canqun Yang
Steven Bosscher <[EMAIL PROTECTED]>:

> On Thursday 21 April 2005 17:37, Mostafa Hagog wrote:
> > The other thing is to analyze this problem more 
deeply but I don't have
> > IA64.
> ...and I don't care enough about it.  Canqun?
>
> Gr.
> Steven
>
> 

Ok, I'll try this.

Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


check_ext_dependent_givs

2005-05-05 Thread Canqun Yang
Hi, all,

Is there anyone familiar with the check routine 
check_ext_dependent_givs defined loop.c, and give me 
an example explaining why it is needed.

Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


Re: check_ext_dependent_givs

2005-05-12 Thread Canqun Yang
Hi, Bonzini,

Thank you for your reponse.
I do not want to modify the old loop optimizer defined 
in loop.c.  I am preparing to port some improvements 
done on gcc-3.5 to gcc-4.0, and the GIV optimizations 
is part of my concerns.

On IA-64, the GIV optimization can hardly improve the 
performance. The reason is that 
check_ext_dependent_givs can not giv an exactly answer 
whether the BIVs will be wrap around or not. In most 
cases, it only produce a conservative result that the 
BIVs may overflow and the corresponding GIVs can not 
be reduced. 

I modified the code in check_ext_dependent_givs to let 
the BIVs always successfully pass the check, then test 
the example you have given to me, but the result is 
the same as before.

Would you please give me another example which will 
lead a wrong result if the check_ext_dependnet_givs 
has not been called. FORTRAN program is nice, and my 
platform is a 64bit system.

Best regards,

Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


Re: check_ext_dependent_givs

2005-05-13 Thread Canqun Yang
Hi, all,

I do not want to modify the old loop optimizer defined 
in loop.c. I am preparing to port some improvements 
done on gcc-3.5 to gcc-4.0, and the GIV optimizations 
is one of my concerns.

On IA-64, the GIV optimizations can hardly improve the 
performance. The reason is that 
check_ext_dependent_givs can not give an exactly 
answer whether the BIVs will be wrap around or not. As
check_ext_dependent_givs can only deal with BIVs in 
constant-iteration loops or BIVs are the same as the 
loop iteration variable, and only small parts of BIVs 
satisfy this condition, that in most cases, only a 
conservative result is produced to report that the 
BIVs may overflow and the corresponding GIVs can not 
be reduced. 

I modified the code in check_ext_dependent_givs to let 
the BIVs always successfully pass the check, then 
tested the NAS benchmarks and SPEC CPF2000 benchmarks, 
excepting significant performance improvements, no 
extra errors occurred.

I have read the codes in check_ext_dependent_givs and 
the mails abount BIV overflow checking in GCC's 
mailing list written by Richard Henderson and Zdenek 
Dvorak, also tested the example Paolo Bonzini sent to 
me. But I still have some questions about this.

1. There is an option '-fwrapv' to control the 
behavior of signed overflows. Can it also be used in 
check_ext_dependent_givs?

2. If check_ext_dependent_givs has not been invoked, 
the program will give wrong result, otherwise, 
correct. Would you please send me an example to show 
this?  (FORTRAN programs are nicer).

3. For FORTRAN programs, is there any thing special. 
As I know, only signed integers in FORTRAN, also the 
counted loops in FORTRAN are more strict than in C?

4. Is it reasonable to turn off this checking at some 
optimization level or with compile options like '-
ffast-math' and '-fno-wrapv'?

5. Is there any way to extend the function of 
check_ext_dependent_givs to manage 
non-iteration-variable BIVs in non-constant-iteration 
loops. I have tried but failed. 


Best regards,

Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


Re: SMS in gcc4.0

2005-06-01 Thread Canqun Yang
Hi, all

I've taken a look on modulo-sched.c recently, and found
that both new_cycles and orig_cycles are imprecise. The
reason is that kernel_number_of_cycles does not take the
data dependences of insns into account as the DFA
scheduler does in haifa-sched.c.  

On IA-64, three improvements are needed to let SMS work.
1) Modify doloop_register_get or the similar function
defined in doloop.c to recognize the loop count
register. I have supplied a patch about this in April.

2) Use more precise way to calculate the values of the
two kind of cycles, or just ignore this benefit assertion.

3) The counted loop register 'ar.lc' of IA-64 can not be
updated  directly. Another temporary register is needed
to evaluate the value of the actural loop count after
SMS schedule, and assign its value to 'ar.lc'.


Mostafa Hagog <[EMAIL PROTECTED]>:

> 
>
>
>
> Steven Bosscher <[EMAIL PROTECTED]> wrote on 22/04/2005
09:39:09:
>
>
> >
> > Thanks!
> > For the record, this refers to a patch I sent to
Mostafa and Canqun to
> > do what Mostafa suggested last month to make SMS
work for ia64, see
> > http://gcc.gu.org/ml/gcc-patches/2005-03/msg02848.html.
>
> I have tested the patch on powerpc-apple-darwin and
there are some tests
> that
> started failing. So I am going to debug it to see what
causes the failures.
>
> Mostafa.
>
> >
> > Gr.
> > Steven
> >
> >
>
> 


Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


Re: SMS in gcc4.0

2005-06-01 Thread Canqun Yang
Steven Bosscher <[EMAIL PROTECTED]>:

> On Wednesday 01 June 2005 16:43, Canqun Yang wrote:
> > Hi, all
> >
> > I've taken a look on modulo-sched.c recently, and 
found
> > that both new_cycles and orig_cycles are 
imprecise. The
> > reason is that kernel_number_of_cycles does not 
take the
> > data dependences of insns into account as the DFA
> > scheduler does in haifa-sched.c.
>
> How does this affect the cycles computation?
>

An insns is ready for schedule only when all the insns 
it dependent on have already be scheduled. In haifa-
sched.c, there is a queue to hold the insns which are 
ready for schedule.

To find how the data dependence affect the cycles 
computation, the more simple way is to compare the  
two versions of assembly code generated by GCC 
respectively, one is generated by turning on '-fmodulo-
sched', the other not. Without SMS, the code in loop 
has many stops ';;' to seperate the instrcutions which 
have data dependence, while with SMS, though the 
kernel code of the loop has more instructions, but 
less stops ';;'. 

> > On IA-64, three improvements are needed to let SMS 
work.
> > 1) Modify doloop_register_get or the similar 
function
> > defined in doloop.c to recognize the loop count
> > register. I have supplied a patch about this in 
April.
>
> Mustafa and I have a patch that has a similar 
effect, see
> http://gcc.gnu.org/ml/gcc-patches/2005-
06/msg00035.html.
>
> > 2) Use more precise way to calculate the values of 
the
> > two kind of cycles, or just ignore this benefit 
assertion.
>
> Probably need to be more precise :-/
>
> When I manually hacked modulo-sched.c to ignore this 
test, I
> did see loops getting scheduled, but I also ran into 
ICEs in
> cfglayout.

There are no ICEs for pi.f90, swim.f, and mgrid.f 
according to my test. But, an internal compile error 
of 'unrecognizable insn' is produced 
by 'gen_sub2_insn' which explicitly minus 'ar.lc' when 
swim.f and mgrid.f are being compiled.

>
> > 3) The counted loop register 'ar.lc' of IA-64 can 
not be
> > updated  directly. Another temporary register is 
needed
> > to evaluate the value of the actural loop count 
after
> > SMS schedule, and assign its value to 'ar.lc'.
>
> Actually, should SMS just not update the loop 
register in place?
> I never figured out why it tries to produce a sub 
insns (using
> gen_sub2_insn which is also wrong btw).
>

The current implementation of SMS does not use IA-64's 
epilog register (ar.ec). After SMS, the loop count is 
just used to control the execution times of the kernel 
code, and the kernel code will execute 
   loop_count - (stage_count - 1) times
The sub insns generated by gen_sub2_insn is used to 
produce this value.


> Gr.
> Steven
>
> 


Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


Re: SMS in gcc4.0

2005-06-01 Thread Canqun Yang
Canqun Yang <[EMAIL PROTECTED]>:

> Steven Bosscher <[EMAIL PROTECTED]>:
>
> > On Wednesday 01 June 2005 16:43, Canqun Yang wrote:
> > > Hi, all
> > >
> > > I've taken a look on modulo-sched.c recently, and
> found
> > > that both new_cycles and orig_cycles are
> imprecise. The
> > > reason is that kernel_number_of_cycles does not
> take the
> > > data dependences of insns into account as the DFA
> > > scheduler does in haifa-sched.c.
> >
> > How does this affect the cycles computation?
> >
>
> An insns is ready for schedule only when all the 
insns
> it dependent on have already be scheduled. In haifa-
> sched.c, there is a queue to hold the insns which are
> ready for schedule.
>
> To find how the data dependence affect the cycles
> computation, the more simple way is to compare the
> two versions of assembly code generated by GCC
> respectively, one is generated by turning on '-
fmodulo-
> sched', the other not. Without SMS, the code in loop
> has many stops ';;' to seperate the instrcutions 
which
> have data dependence, while with SMS, though the
> kernel code of the loop has more instructions, but
> less stops ';;'.
>
> > > On IA-64, three improvements are needed to let 
SMS
> work.
> > > 1) Modify doloop_register_get or the similar
> function
> > > defined in doloop.c to recognize the loop count
> > > register. I have supplied a patch about this in
> April.
> >
> > Mustafa and I have a patch that has a similar
> effect, see
> > http://gcc.gnu.org/ml/gcc-patches/2005-
> 06/msg00035.html.
> >
> > > 2) Use more precise way to calculate the values 
of
> the
> > > two kind of cycles, or just ignore this benefit
> assertion.
> >
> > Probably need to be more precise :-/
> >
> > When I manually hacked modulo-sched.c to ignore 
this
> test, I
> > did see loops getting scheduled, but I also ran 
into
> ICEs in
> > cfglayout.
>
> There are no ICEs for pi.f90, swim.f, and mgrid.f
> according to my test. But, an internal compile error
> of 'unrecognizable insn' is produced
> by 'gen_sub2_insn' which explicitly minus 'ar.lc' 
when
> swim.f and mgrid.f are being compiled.


There is no ICEs for pi.f90 according to my test. But 
ICEs of 'unreconizable insn' is procuded 
by 'gen_sub2_insns' which explicitly minus 'ar.lc' 
when swim.f and mgrid.f are being compiled.


>
> >
> > > 3) The counted loop register 'ar.lc' of IA-64 can
> not be
> > > updated  directly. Another temporary register is
> needed
> > > to evaluate the value of the actural loop count
> after
> > > SMS schedule, and assign its value to 'ar.lc'.
> >
> > Actually, should SMS just not update the loop
> register in place?
> > I never figured out why it tries to produce a sub
> insns (using
> > gen_sub2_insn which is also wrong btw).
> >
>
> The current implementation of SMS does not use IA-
64's
> epilog register (ar.ec). After SMS, the loop count is
> just used to control the execution times of the 
kernel
> code, and the kernel code will execute
>loop_count - (stage_count - 1) times
> The sub insns generated by gen_sub2_insn is used to
> produce this value.
>
>
> > Gr.
> > Steven
> >
> >
>

Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


Function Inlining for FORTRAN

2005-07-20 Thread Canqun Yang
Hi, all

Function inlining for FORTRAN programs always fails. If no one engages in it, I 
will give a try.
Would you please give me some clues?

Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


Re: Function Inlining for FORTRAN

2005-07-21 Thread Canqun Yang
Paul Brook <[EMAIL PROTECTED]>:

> On Wednesday 20 July 2005 15:35, Canqun Yang wrote:
> > Hi, all
> >
> > Function inlining for FORTRAN programs always fails. 
> 
> Not entirely true. Inlining of contained procedures works fine (or it did la
> st 
> time I checked). This should include inlining of siblings within a module.
> 
> > If no one engages in it, I will give a try. Would you please give me
> > some clues? 
> 
> The problem is that each top level program unit (PU)[1] is compiled 
> separately. Each PU has it's own "external" decls for all function calls, 
> even if the function happens to be in the same function. Thus each PU is an 
> 
> isolated self-contained tree structure, and the callgraph doesn't know the 
> definition and declaration are actually the same thing.
> 
> Basically what you need to do is parse the whole file, then start generating
>  
> code.
> 
> Unfortunately this isn't simple (or it would have been fixed already!).
> Unlike C Fortran doesn't have file-level scope. It makes absolutely no 
> difference whether two procedures are in the same file, or in different 
> files.  You get all the problems that multifile IPA in C experiences within 
> a 
> single Fortran file. 
> 
> The biggest problem is type consistency and aliasing. Consider the following
>  

I have several FORTRAN 77 programs. After inlining the small functions in them 
by hand, they 
made a great performance improvements. So I need a trial implementation of 
function inlining to 
verify the effectiveness of it.

Now, my question is: If we just take the FORTRAN 77 syntax into account (no 
derived types, no 
complex aliasing), may it be simpler to implement function inlining for FORTRAN 
77.

> 
> Paul
> 


Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


Re: IPA branch

2005-08-05 Thread Canqun Yang
Hi,

Patch from Michael Matz (http://gcc.gnu.org/ml/fortran/2005-07/msg00331.html) 
may partly fixes 
the multiple decls problems.

I've tested and tuned this patch. It works, small functions can be inlined 
after DECL_INLINE 
flags (build_function_decl in trans-decl.c) have been set for them. The only 
regression is 
FORTRAN 95 testcase function_modulo_1.f90, it produces a wong result. 


Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.


[patch] Improve loop array prefetch for IA-64

2006-06-02 Thread Canqun Yang
Hi, all

This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS 
benchmark suite on
Itanium-2 system, respectively. More performance increase is hopeful by further 
tuning the
parameters and improving the prefetch algorithm at tree level. 

Details of NAS benchmarks are listed below.


GCC options: -O3 -fprefetch-loop-arrays
Target: Itanium-2 1.6GHz; L2 Cache 256K, L3 Cache 6M
Execution times in seconds

   -this patch +this patch
bt.W   14.4314.17
cg.A   13.766.86
ep.W   7.83 7.79
ft.A   18.7320.15
is.B   11.8510.94
lu.W   20.5520.27
mg.A   15.0911.86
sp.W   37.1135.49
geomean15.8413.94
speedup 13.68%


2006-06-02  Canqun Yang  <[EMAIL PROTECTED]>

 * config/ia64/ia64.h (SIMULTANEOUS_PREFETCHES): Define to 18.
 (PREFETCH_BLOCK): Define to 128.
 (PREFETCH_LATENCY): Define to 400.

Index: ia64.h
===
--- ia64.h (revision 114307)
+++ ia64.h (working copy)
@@ -1985,13 +1985,18 @@
??? This number is bogus and needs to be replaced before the value is
actually used in optimizations.  */
 
-#define SIMULTANEOUS_PREFETCHES 6
+#define SIMULTANEOUS_PREFETCHES 18
 
 /* If this architecture supports prefetch, define this to be the size of
the cache line that is prefetched.  */
 
-#define PREFETCH_BLOCK 32
+#define PREFETCH_BLOCK 128
 
+/* A number that should roughly corresponding to the nunmber of instructions
+   executed before the prefetch is completed.  */
+
+#define PREFETCH_LATENCY 400
+
 #define HANDLE_SYSV_PRAGMA 1
 
 /* A C expression for the maximum number of instructions to execute via


Canqun Yang


__
赶快注册雅虎超大容量免费邮箱?
http://cn.mail.yahoo.com


Re: [patch] Improve loop array prefetch for IA-64

2006-06-02 Thread Canqun Yang
--- Andrey Belevantsev <[EMAIL PROTECTED]>:

> Canqun Yang wrote:
> > Hi, all
> > 
> > This patch results a performance increase of 4% for SPECfp2000 and 13% for 
> > NAS benchmark suite
> on
> > Itanium-2 system, respectively. More performance increase is hopeful by 
> > further tuning the
> > parameters and improving the prefetch algorithm at tree level. 
> 
> Hi Canqun,
> 
> It's great news that you continued to work on prefetching tuning for 
> ia64!  Do you plan to port your other changes for the old RTL 
> prefetching to the tree level?
> 

Yes. But I have no much time to do it now. I am busy for other things.

> > @@ -1985,13 +1985,18 @@
> > ??? This number is bogus and needs to be replaced before the value is
> > actually used in optimizations.  */
> 
> I suggest to remove this comment as it has become outdated with your 
> patch.  Instead you might say how did you choose this particular value 
> (and PREFETCH_BLOCK too).  Just my 2c.
> 
> Andrey
> 
> 

Please refer to my previous mail and attatched paper.

Canqun Yang

__
赶快注册雅虎超大容量免费邮箱?
http://cn.mail.yahoo.com


RE: [patch] Improve loop array prefetch for IA-64

2006-06-02 Thread Canqun Yang

--- "Davis, Mark" <[EMAIL PROTECTED]>:

> Canqun,
> 
> Nice job getting this ready for the current version of gcc!
> 
> Question: does gcc now know the difference between prefetching to cache L1 
> via "lfetch", as
> opposed to prefetching only to level L2 via "lfetch.nt1"?  For floating point 
> data, the latter
> is the only interesting case because float loads only access the L2.  Thus 
> using "lfetch" for
> floating point arrays will unnecessarily wipe out the contents of L1.  (gcc 
> 3.2.3 only seems to
> generate "lfetch", which is why I ask...)
> 

Yes, GCC does. I have tried this on the old prefetch implementation at RTL 
level and the new one
at TREE level, but no significant performance difference for SPECfp2000 and NAS 
benchmarks.
Nevertheless, it worth taking more time to inspect it.

Canqun Yang


> Thanks,
> Mark 
> 
> -Original Message-
> From: Canqun Yang [mailto:[EMAIL PROTECTED] 
> Sent: Friday, June 02, 2006 5:14 AM
> To: gcc@gcc.gnu.org; [EMAIL PROTECTED]
> Subject: [patch] Improve loop array prefetch for IA-64
> 
> Hi, all
> 
> This patch results a performance increase of 4% for SPECfp2000 and 13% for 
> NAS benchmark suite
> on
> Itanium-2 system, respectively. More performance increase is hopeful by 
> further tuning the
> parameters and improving the prefetch algorithm at tree level. 
> 
> 
> Canqun Yang
> 
> 

__
赶快注册雅虎超大容量免费邮箱?
http://cn.mail.yahoo.com


The execution times of each function call in call graph

2006-10-06 Thread Canqun Yang
Hi, all

Is there any way to get the (estimated) execution times of each function call 
during IPA passes?
Currently, in GCC, the loop information can only be formed after tree-ssa pass 
by calling
loop_optimizer_init, so it is impossible to estimated the times of a function 
call when the IPA
optimizations, like inlining, are executed. Am I right?

Canqun



___ 
雅虎免费邮箱-3.5G容量,20M附件 
http://cn.mail.yahoo.com/


relocation truncated to fit

2007-07-26 Thread Canqun Yang
Hi, all

Can anyone help me to resolve this problem?

When I compile a program with .bss segement larger than 2.0GB, I get the
following error message from GNU linker (binutils-2.15).

(.text+0x305): In function `sta_':
: relocation truncated to fit: R_X86_64_32S plot_
..

I upgrade the assembler and the linker from binutis-2.17, then get the
message below. 

STA.o: In function `sta_':
STA.F:(.text+0x305): relocation truncated to fit: R_X86_64_32S against
symbol `plot_' defined in COMMON section in STA.o

So, I modified the binutils-2.17/bfd/elf64-x86-64.c and rebuild the linker
to ignore the relocation errors. Though the executable generated,
segementation fault occurred during execution.

Here is the configuration of my computer:

CPU: Intel(R) Xeon(R) CPU5150  @ 2.66GHz
OS: Linux mds 2.6.9-34.EL_lustre.1.4.6.1custom #3 SMP Fri Jul 13 15:27:27
CST 2007 x86_64 x86_64 x86_64 GNU/Linux
Compiler: Intel C++/Fortran compiler for linux 10.0

I also wrote a program with large uninitialized data -- more than 2.0GB.
It passes after linked with the modified linker. The source code is appended.

#define N 0x05fff

double a[N][N];

int
main ()
{
  int i, j;
  double sum;

  for (i = 0; i < N; i+=5)
for (j = 0; j < N; j+=5)
  a[i][j] = 2* i*j + i*i + j*j;


  sum = 0.0;   
  for (i = 0; i < N; i+=5)
for (j = 0; j < N; j+=5)
  sum += a[i][j];

  printf ("%f\n", sum);
}


Best regards,

Canqun Yang


  ___ 
抢注雅虎免费邮箱3.5G容量,20M附件! 
http://cn.mail.yahoo.com


Re: relocation truncated to fit

2007-07-26 Thread Canqun Yang
Hi, Guenther

It works. Thank you very much!

Canqun Yang

--- Richard Guenther <[EMAIL PROTECTED]>:

> On 7/26/07, Canqun Yang <[EMAIL PROTECTED]> wrote:
> > Hi, all
> >
> > Can anyone help me to resolve this problem?
> >
> > When I compile a program with .bss segement larger than 2.0GB, I get the
> > following error message from GNU linker (binutils-2.15).
> >
> > (.text+0x305): In function `sta_':
> > : relocation truncated to fit: R_X86_64_32S plot_
> > ..
> >
> > I upgrade the assembler and the linker from binutis-2.17, then get the
> > message below.
> >
> > STA.o: In function `sta_':
> > STA.F:(.text+0x305): relocation truncated to fit: R_X86_64_32S against
> > symbol `plot_' defined in COMMON section in STA.o
> >
> > So, I modified the binutils-2.17/bfd/elf64-x86-64.c and rebuild the linker
> > to ignore the relocation errors. Though the executable generated,
> > segementation fault occurred during execution.
> >
> > Here is the configuration of my computer:
> >
> > CPU: Intel(R) Xeon(R) CPU5150  @ 2.66GHz
> > OS: Linux mds 2.6.9-34.EL_lustre.1.4.6.1custom #3 SMP Fri Jul 13 15:27:27
> > CST 2007 x86_64 x86_64 x86_64 GNU/Linux
> > Compiler: Intel C++/Fortran compiler for linux 10.0
> >
> > I also wrote a program with large uninitialized data -- more than 2.0GB.
> > It passes after linked with the modified linker. The source code is 
> > appended.
> 
> Try using -mcmodel=medium
> 
> Richard.
> 
> > #define N 0x05fff
> >
> > double a[N][N];
> >
> > int
> > main ()
> > {
> >   int i, j;
> >   double sum;
> >
> >   for (i = 0; i < N; i+=5)
> > for (j = 0; j < N; j+=5)
> >   a[i][j] = 2* i*j + i*i + j*j;
> >
> >
> >   sum = 0.0;
> >   for (i = 0; i < N; i+=5)
> > for (j = 0; j < N; j+=5)
> >   sum += a[i][j];
> >
> >   printf ("%f\n", sum);
> > }
> >
> >
> > Best regards,
> >
> > Canqun Yang



  ___ 
抢注雅虎免费邮箱3.5G容量,20M附件! 
http://cn.mail.yahoo.com