GIV optimizations
Hi, all The new loop unroller causes performance degradation due to the unimplemented giv (general induction variable) optimizations. When will it be implemented? Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
Re: GIV optimizations
Hi, Gr.Steven Very thanks for your replyment. Induction variables are variables whose successive values form an arithmetic progression over a loop. Induction variables are often divided into bivs (basic induction variables), which are explicitly modified by the same constant amount during each iteration of a loop, and gives (general induction variables), which may be modified or computed by a linear-function of a basic induction variable. There are three important transformations that apply to them: strength reduction, induction-variable removal, and linear-function test replacement. For example, we can do strength reduction of address givs which are usually used for address calculation of array elements. On platforms with post-increment load and store instructions, this will make chance to combine a load/store with the following address calculation instruction. Also the induction variable splitting is an effective optimization during loop unrolling. The new loop optimizer only support a limited gives analysis (ref. loop-iv.c), and has not yet implemented giv splitting (ref. ¡®loop-unroll.c¡¯, ¡®analyze_iv_to_split_insn¡¯, and comments in this function, ¡°For now we just split the basic induction variables. Later this may be extended for example by selecting also addresses of memory references.¡±) I test 171.swim on IA64 system with 1GHz itanium2 CPU. After implemented or improved/adjusted several compile optimizations, such as Fortran alias analysis (very simple one), loop unrolling (the old one), loop arrays pre-fetching, and giv optimizations, it costs just 9.1s (28s for GCC-4.0.0) to execute the train mode of 171.swim. But, apply those changes on current GCC- 4.0.0 (the old loop unroller was removed), it costs 13.4 to execute this benchmark program, and I found that giv optimizations are the major factor of such performance degradation. Giv optimizations are just features which not implemented yet in the new loop unroller, so I think put it in bugzilla is not appropriate. Steven Bosscher <[EMAIL PROTECTED]>: > On Feb 28, 2005 02:35 PM, Canqun Yang <[EMAIL PROTECTED]> wrote: > > > Hi, all > > > > The new loop unroller causes performance degradation > > due to the unimplemented giv (general induction > > variable) optimizations. > > > > When will it be implemented? > > Will you be more specific so we can have a clue what you are > talking about? Filing bugs in bugzilla showing the problem > would help. > > Gr. > Steven > > Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
Re: GIV optimizations
Zdenek Dvorak <[EMAIL PROTECTED]>: > Hello, > > > Giv optimizations are just features which not > > implemented yet in the new loop unroller, so I think > > put it in bugzilla is not appropriate. > > it most definitely is appropriate. This is a performance > regression. Even if it would not be, feature requests > can be put to Bugzilla. > Ok, thanks. > The best of course would be if you could create a small testcase > demonstrating what you would like the compiler to achieve. > > Zdenek > I attached a testcase with the two assembly code versions, one for which has address giv splitting in the loop unroller, the other not. Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China. giv.f90 Description: Binary data giv_no_opt.s Description: Binary data giv_opt.s Description: Binary data
[rtl-optimization] Improve Data Prefetch for IA-64
Hi, all Currently, GCC just ignores all data prefetches within loop when the number of prefetches exceeds SIMULTANEOUS_PREFETCHES. It isn't advisable. Also, macros defined in ia64.h for data prefetching are too small. This patch modified the data prefetch algorithm defined in loop.c and redefines some macros in ia64.h accordingly. The test shows 2.5 percent perfomance improvements is gained for SPEC CFP2000 benchmarks on IA-64. If the new loop unroller was perfectly (just like the old one which was removed) implemented, much more performance improvements would be gained. Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China. 2005-03-25 Canqun Yang <[EMAIL PROTECTED]> * ia64.c (SIMULTANEOUS_PREFETCHES): Redefine as 18. (PREFETCH_BLOCK): Redefine as 64. (PREFETCH_BLOCKS_BEFORE_LOOP_MAX): New definition. 2005-03-25 Canqun Yang <[EMAIL PROTECTED]> * loop.c (PREFETCH_BLOCKS_BEFORE_LOOP_MAX): Defined conditionally. (scan_loop): Change extra_size from 16 to 128. (emit_prefetch_instructions): Don't ignore all prefetches within loop.Index: loop.c === RCS file: /cvs/gcc/gcc/gcc/loop.c,v retrieving revision 1.522 diff -c -3 -p -r1.522 loop.c *** loop.c 17 Jan 2005 08:46:15 - 1.522 --- loop.c 25 Mar 2005 12:03:44 - *** struct loop_info *** 434,440 --- 434,442 #define MAX_PREFETCHES 100 /* The number of prefetch blocks that are beneficial to fetch at once before a loop with a known (and low) iteration count. */ + #ifndef PREFETCH_BLOCKS_BEFORE_LOOP_MAX #define PREFETCH_BLOCKS_BEFORE_LOOP_MAX 6 + #endif /* For very tiny loops it is not worthwhile to prefetch even before the loop, since it is likely that the data are already in the cache. */ #define PREFETCH_BLOCKS_BEFORE_LOOP_MIN 2 *** scan_loop (struct loop *loop, int flags) *** 1100,1106 /* Allocate extra space for REGs that might be created by load_mems. We allocate a little extra slop as well, in the hopes that we won't have to reallocate the regs array. */ ! loop_regs_scan (loop, loop_info->mems_idx + 16); insn_count = count_insns_in_loop (loop); if (loop_dump_stream) --- 1102,1108 /* Allocate extra space for REGs that might be created by load_mems. We allocate a little extra slop as well, in the hopes that we won't have to reallocate the regs array. */ ! loop_regs_scan (loop, loop_info->mems_idx + 128); insn_count = count_insns_in_loop (loop); if (loop_dump_stream) *** emit_prefetch_instructions (struct loop *** 4398,4406 { if (loop_dump_stream) fprintf (loop_dump_stream, !"Prefetch: ignoring prefetches within loop: ahead is zero; %d < %d\n", SIMULTANEOUS_PREFETCHES, num_real_prefetches); ! num_real_prefetches = 0, num_real_write_prefetches = 0; } } /* We'll also use AHEAD to determine how many prefetch instructions to --- 4400,4411 { if (loop_dump_stream) fprintf (loop_dump_stream, !"Prefetch: ignoring some prefetches within loop: ahead is zero; %d < %d\n", SIMULTANEOUS_PREFETCHES, num_real_prefetches); ! num_real_prefetches = MIN (num_real_prefetches, !SIMULTANEOUS_PREFETCHES); ! num_real_write_prefetches = MIN (num_real_write_prefetches, ! SIMULTANEOUS_PREFETCHES); } } /* We'll also use AHEAD to determine how many prefetch instructions to Index: config/ia64/ia64.h === RCS file: /cvs/gcc/gcc/gcc/config/ia64/ia64.h,v retrieving revision 1.194 diff -c -3 -p -r1.194 ia64.h *** config/ia64/ia64.h 17 Mar 2005 17:35:16 - 1.194 --- config/ia64/ia64.h 25 Mar 2005 12:05:05 - *** do { \ *** 1993,2004 ??? This number is bogus and needs to be replaced before the value is actually used in optimizations. */ ! #define SIMULTANEOUS_PREFETCHES 6 /* If this architecture supports prefetch, define this to be the size of the cache line that is prefetched. */ ! #define PREFETCH_BLOCK 32 #define HANDLE_SYSV_PRAGMA 1 --- 1993,2008 ??? This number is bogus and needs to be replaced before the value is actually used in optimizations. */ ! #define SIMULTANEOUS_PREFETCHES 18 /* If this architecture supports prefetch, define this to be the size of the cache line that is prefetched. */ ! #define PREFETCH_BLOCK 64 ! ! /* The number of prefetch blo
Re: [rtl-optimization] Improve Data Prefetch for IA-64
The last ChangeLog of rtlopt-branch was written in 2003. After more than one year, many impovements in this branch haven't been put into the GCC HEAD. Why? ÒýÑÔ Steven Bosscher <[EMAIL PROTECTED]>: > On Saturday 26 March 2005 02:22, Canqun Yang wrote: > > Â Â Â Â Â Â Â Â * loop.c (PREFETCH_BLOCKS> > _BEFORE_LOOP_MAX): Defined conditionally. > > Â Â Â Â Â Â Â Â (scan_loop): Change extra> > _size from 16 to 128. > > Â Â Â Â Â Â Â Â (emit_prefetch_instructio> > ns): Don't ignore all prefetches within > > loop. > > OK, so I know this is not a popular subject, but can we *please* stop > working on loop.c and focus on getting the new RTL and tree loop passes > to do what we want? All this loop.c patching is a typical example of > why free software development does not always work: always going for > the low-hanging fruit. In this case, there have been several attempts > to replace the prefetching stuff in loop.c with something better. On > the rtl-opt branch there is a new RTL loop- prefetch.c, and on the LNO > branch there is a re-use analysis based prefetching pass. Why don't > you try to finish and improve those passes, instead of making it yet > again harder to remove loop.c. This one file is a *huge* problem for > just about the entire RTL optimizer path. It is, for example, the > reason why there is no profile information available before this old > piece of, if I may say, junk runs, and it the only reason why a great > many functions in for example jump.c and the various cfg*.c files can > still not be removed. > > Gr. > Steven > > Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
Re: [rtl-optimization] Improve Data Prefetch for IA-64
ÒýÑÔ Steven Bosscher <[EMAIL PROTECTED]>: > On Sunday 27 March 2005 03:53, Canqun Yang wrote: > > The last ChangeLog of rtlopt-branch was written in > > 2003. After more than one year, many impovements in > > this branch haven't been put into the GCC HEAD. Why? > > Almost all of the rtlopt branch was merged. Prefetching is one > of the few things that was not, probably Zdenek knows why. > Another question is why the new RTL loop-unroller does not support giv splitting. It is very usefull according to my tests for the old one. Is there anyone plan to implement it? The writter of the new loop- unroller or someone who is familiar with that part. They will carry out it better and faster, I think. > Gr. > Steven > Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
Re: [rtl-optimization] Improve Data Prefetch for IA-64
ÒýÑÔ Zdenek Dvorak <[EMAIL PROTECTED]>: > Hello, > > > On Sunday 27 March 2005 03:53, Canqun Yang wrote: > > > The last ChangeLog of rtlopt-branch was written in > > > 2003. After more than one year, many impovements in > > > this branch haven't been put into the GCC HEAD. Why? > > > > Almost all of the rtlopt branch was merged. Prefetching is one > > of the few things that was not, probably Zdenek knows why. > > because I never made it work as well as the current version, > basically. At least no matter how much I tried, I never produced > any benchmark numbers that would justify the change. Then I went > to tree-ssa (and tried to write prefetching there, and got stuck on > the same issue, after which I simply forgot due to loads of other work > :-( ). > > Zdenek > It shoud be the similar reason as comments in my previously supplied patch. http://gcc.gnu.org/ml/gcc-patches/2005-03/msg02400.html . Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
Re: [rtl-optimization] Improve Data Prefetch for IA-64
ÒýÑÔ Zdenek Dvorak <[EMAIL PROTECTED]>: > Hello, > > > On Sunday 27 March 2005 04:45, Canqun Yang wrote: > > > Another question is why the new RTL loop- unroller does > > > not support giv splitting. > > > > Apparently because for most people it is not a problem that it does > > not do it, and while you have indicated earlier that it may be useful > > for you, you have neither tried to implement it yourself, nor provided > > a test case to PR20376. > > > > FWIW you could try -fweb and see if it does what you want. And if it > > does, you could write a limited webizer patch that works on just loop > > bodies after unrolling. > > from what I understood, change to analyze_iv_to_split_insn that would > detect memory references should suffice. Also maybe this patch might be > relevant: > > http://gcc.gnu.org/ml/gcc-patches/2004- 10/msg01176.html > > Zdenek > I'll try this. Thanks a lot. Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
RE: SMS in gcc4.0
Hi, all This patch will fix doloop_register_get defined in modulo-sched.c, and let the program of PI caculation on IA-64 be successfully modulo scheduled. On 1GHz Itanium-2, it costs just 3.128 seconds to execute when compiled with "-fmodulo-shced -O3" turned on, while 5.454 seconds whithout "-fmodulo-sched". 2005-03-31 Canqun Yang <[EMAIL PROTECTED]> * modulo-sched.c (doloop_register_get): Deal with if_then_else pattern. Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China. pi.f90 Description: Binary data *** /home/ycq/mainline/gcc/gcc/modulo-sched.c Mon Mar 21 10:49:23 2005 --- modulo-sched.c Thu Mar 31 21:11:08 2005 *** static rtx *** 263,269 doloop_register_get (rtx insn, rtx *comp) { rtx pattern, cmp, inc, reg, condition; ! if (!JUMP_P (insn)) return NULL_RTX; pattern = PATTERN (insn); --- 263,270 doloop_register_get (rtx insn, rtx *comp) { rtx pattern, cmp, inc, reg, condition; ! rtx src; ! if (!JUMP_P (insn)) return NULL_RTX; pattern = PATTERN (insn); *** doloop_register_get (rtx insn, rtx *comp *** 293,303 /* Extract loop counter register. */ reg = SET_DEST (inc); /* Check if something = (plus (reg) (const_int -1)). */ ! if (GET_CODE (SET_SRC (inc)) != PLUS ! || XEXP (SET_SRC (inc), 0) != reg ! || XEXP (SET_SRC (inc), 1) != constm1_rtx) return NULL_RTX; /* Check for (set (pc) (if_then_else (condition) --- 294,315 /* Extract loop counter register. */ reg = SET_DEST (inc); + src = SET_SRC (inc); + /* On IA-64, the RTL pattern of SRC is just like this + (if_then_else:DI (ne (reg:DI 332 ar.lc) + (const_int 0 [0x0])) + (plus:DI (reg:DI 332 ar.lc) + (const_int -1 [0x])) + (reg:DI 332 ar.lc)) */ + + if (GET_CODE (src) == IF_THEN_ELSE) + src = XEXP (src, 1); + /* Check if something = (plus (reg) (const_int -1)). */ ! if (GET_CODE (src) != PLUS ! || XEXP (src, 0) != reg ! || XEXP (src, 1) != constm1_rtx) return NULL_RTX; /* Check for (set (pc) (if_then_else (condition) *** doloop_register_get (rtx insn, rtx *comp *** 318,324 if ((GET_CODE (condition) != GE && GET_CODE (condition) != NE) || GET_CODE (XEXP (condition, 1)) != CONST_INT). */ if (GET_CODE (condition) != NE ! || XEXP (condition, 1) != const1_rtx) return NULL_RTX; if (XEXP (condition, 0) == reg) --- 330,337 if ((GET_CODE (condition) != GE && GET_CODE (condition) != NE) || GET_CODE (XEXP (condition, 1)) != CONST_INT). */ if (GET_CODE (condition) != NE ! || (XEXP (condition, 1) != const1_rtx ! && XEXP (condition, 1) != const0_rtx)) return NULL_RTX; if (XEXP (condition, 0) == reg)
Re: [rtl-optimization] Improve Data Prefetch for IA-64
>On Mon, 28 Mar 2005, James E Wilson wrote: >> Steven Bosscher wrote: >>> OK, so I know this is not a popular subject, but can we *please* stop >>> working on loop.c and focus on getting the new RTL and tree loop passes >>> to do what we want? >> I don't think anyone is objecting to this. [...] >> I would however make a distinction here between new development work and >> maintenance. It would be better if new development work happened in the new >> loop optimizer. However, we still need to do maintenance work in loop.c. > >...and since Canqun reported 2.5% improvement on SPEC CFP2000 on ia64 with >his current patch, I really think we should consider it. > Besides this, I¡¯ve got another patch for improving general induction variable optimizations defined in loop.c. With these two patches and properly setting the loop unrolling parameters, the tests of both NAS and SPEC CPU2000 benchmarks on IA-64 1GHz system show a good result. 1. The following table shows the test result of NAS benchmarks: Gcc-20050404Gcc-20050404Ratio + Optimized. Bt.W22.16s 22.68s 0.98 Cg.A9.23s 7.45s 1.24 Ep.W12.3s 11.97s 1.03 Ft.A38.41s 25.98s 1.48 Is.B34.94s 33.47s 1.04 Lu.W32.93s 31.59s 1.04 Mg.A21.91s 14.64s 1.50 Sp.W59.71s 55.67s 1.07 Geomean 1.16 "Gcc-20050404" is the GCC mainline version dated on April 4, 2005. It includes my previous patch of RECORD_TYPE for COMMON blocks without equivalence objects. The compile options for ¡°Gcc-20050404¡± is ¡°-O3 -funroll-loops -fprefetch-loop-arrays¡±, and ¡°- O3 -funroll-loops -fprefetch-loop-arrays --param max- unrolled-insns=600 --param max-average-unrolled- insns=320¡± for¡°Gcc-20050404+Optimized¡±. 2. The SPEC CFP2000 test uses the same options as above. ¡°Gcc-20050404¡± got 426 SPEC ratio, and ¡°Gcc- 20050404 + Optimized¡± got 459 SPEC ratio. You can download the attachments to see more details. And if the address giv splitting were not miss in the new loop unroller, the SPEC ratio up to 513 can be expected. >We all know how hard it is to get this kind of improvement on any of the >SPECs -- and in fact improving the current optimizers will make raise >the >bar for the new ones. ;-) > >Question is: who is going review/potentially approve this patch? > >Gerald Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China. CFP2000.154.pdf Description: Adobe PDF document CFP2000.155.pdf Description: Adobe PDF document
Re: [rtl-optimization] Improve Data Prefetch for IA-64
Steven Bosscher <[EMAIL PROTECTED]>: > > What happens if you use the memory address unrolling patch, turn on > -fweb, and set the unrolling parameters properly? > The memory address unrolling patch can't work on IA- 64, and the -fweb can improve the unroller, but still far away from the old one. So, I plan to port my work on new loop optimizer after Zdenek has commited his patches. > > Gr. > Steven > > Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
Inline round for IA64
Hi, all Gfortran translates the Fortran 95 intrinsic DNINT to round operation with double precision type argument and return value. Inline round operation will speed up the SPEC CFP2000 benchmark 189.lucas which contains function calls of intrinsic DNINT from 706 (SPEC ratio) to 783 on IA64 1GHz system. I have implemented the double precison version of inline round. If it is worth doing, I can go on to finish the other precision mode versions. 2005-04-07 Canqun Yang <[EMAIL PROTECTED]> * config/ia64/ia64.md (UNSPEC_ROUND): New constant. (floatxfxf2, fix_truncxf2): New instruction patterns. (rounddf2): New expander. (rounddf2_internal): New define_insn_and_split implementing inline calculation of DFmode round. * config/ia64/ia64.opt (-minline-round, -mno- inline-round): Add new IA64 options. * doc/invoke.texi: Ditto. Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China. ia64.md.diff Description: Binary data invoke.texi.diff Description: Binary data
Re: Inline round for IA64
Geert Bosch <[EMAIL PROTECTED]>: > As far as I can seem from this patch, it rounds incorrectly. > This is a problem with the library version as well, I believe. > > The issue is that one cannot round a positive float to int > by adding 0.5 and truncating. (Same issues with negative values > and subtracting 0.5, of course). This gives an error for the > predecessor of 0.5. The between Pred (0.5) and 0.5 is half that of > pred (1.0) and 1.0. So the value of Pred (0.5) + 0.5 lays exactly > halfway Pred (1.0) and 1.0. The CPU rounds this halfway value to > even, or 1.0 in this case. > > So try rounding .499944488848768742172978818416595 4589843750 > using IEEE double on non-x86 platform, and you'll see it gets rounded > to 1.0. Do you mean the correct value should be 0.0 ? > A similar problem exists with large odd integers between 2^52+1 and > 2^53-1, > where adding 0.5 results in a value exactly halfway two integers, > rounding up to the nearest even integer. So, for IEEE double, > 4503599627370497 would round to 4503599627370498. Do you mean 4503599627370498 is a wrong result? > > These issues can be fixed by not adding/subtracting 0.5, but Pred (0.5). > As shown above, this rounds to 1.0 correctly for 0.5. For larger values > halfway two integers, the gap with the next higher representable number > will > only decrease so the result will always be rounded up to the next higher > integer. For this technique to work, however, it is necessary that the > addition will be rounded to the target precision according to IEEE > round-to-even semantics. On platforms such as x86, where GCC implicitly > widens intermediate results for IEEE double, the rounding to integer > should be performed entirely in long double mode, using the long double > predecessor of 0.5. > > See ada/trans.c around line 5340 for an example of how Ada does this. > >-Geert > > On Apr 7, 2005, at 05:38, Canqun Yang wrote: > > Gfortran translates the Fortran 95 intrinsic DNINT to > > round operation with double precision type argument > > and return value. Inline round operation will speed up > > the SPEC CFP2000 benchmark 189.lucas which contains > > function calls of intrinsic DNINT from 706 (SPEC > > ratio) to 783 on IA64 1GHz system. > > > > I have implemented the double precison version of > > inline round. If it is worth doing, I can go on to > > finish the other precision mode versions. > > I attached an example for intrinsic DNINT with its output. Would you please check it, and tell me whether the result is correct. Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China. ! Test case for inline round subroutine dnint_ex (a, b, n) real*8 a(n), b(n) integer n do i = 1, n b(i) = dnint (a(i)) enddo end program round_test real*8 a(2), b(2) a(:) = (/.4999444888487687421729788184165954589843750_8,& 4503599627370497.0_8/) call dnint_ex (a, b, 2) write (*,*) b end The output is: 0.00 4.503599627370497E+015
Re: SMS in gcc4.0
Steven Bosscher <[EMAIL PROTECTED]>: > On Thursday 21 April 2005 17:37, Mostafa Hagog wrote: > > The other thing is to analyze this problem more deeply but I don't have > > IA64. > ...and I don't care enough about it. Canqun? > > Gr. > Steven > > Ok, I'll try this. Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
check_ext_dependent_givs
Hi, all, Is there anyone familiar with the check routine check_ext_dependent_givs defined loop.c, and give me an example explaining why it is needed. Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
Re: check_ext_dependent_givs
Hi, Bonzini, Thank you for your reponse. I do not want to modify the old loop optimizer defined in loop.c. I am preparing to port some improvements done on gcc-3.5 to gcc-4.0, and the GIV optimizations is part of my concerns. On IA-64, the GIV optimization can hardly improve the performance. The reason is that check_ext_dependent_givs can not giv an exactly answer whether the BIVs will be wrap around or not. In most cases, it only produce a conservative result that the BIVs may overflow and the corresponding GIVs can not be reduced. I modified the code in check_ext_dependent_givs to let the BIVs always successfully pass the check, then test the example you have given to me, but the result is the same as before. Would you please give me another example which will lead a wrong result if the check_ext_dependnet_givs has not been called. FORTRAN program is nice, and my platform is a 64bit system. Best regards, Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
Re: check_ext_dependent_givs
Hi, all, I do not want to modify the old loop optimizer defined in loop.c. I am preparing to port some improvements done on gcc-3.5 to gcc-4.0, and the GIV optimizations is one of my concerns. On IA-64, the GIV optimizations can hardly improve the performance. The reason is that check_ext_dependent_givs can not give an exactly answer whether the BIVs will be wrap around or not. As check_ext_dependent_givs can only deal with BIVs in constant-iteration loops or BIVs are the same as the loop iteration variable, and only small parts of BIVs satisfy this condition, that in most cases, only a conservative result is produced to report that the BIVs may overflow and the corresponding GIVs can not be reduced. I modified the code in check_ext_dependent_givs to let the BIVs always successfully pass the check, then tested the NAS benchmarks and SPEC CPF2000 benchmarks, excepting significant performance improvements, no extra errors occurred. I have read the codes in check_ext_dependent_givs and the mails abount BIV overflow checking in GCC's mailing list written by Richard Henderson and Zdenek Dvorak, also tested the example Paolo Bonzini sent to me. But I still have some questions about this. 1. There is an option '-fwrapv' to control the behavior of signed overflows. Can it also be used in check_ext_dependent_givs? 2. If check_ext_dependent_givs has not been invoked, the program will give wrong result, otherwise, correct. Would you please send me an example to show this? (FORTRAN programs are nicer). 3. For FORTRAN programs, is there any thing special. As I know, only signed integers in FORTRAN, also the counted loops in FORTRAN are more strict than in C? 4. Is it reasonable to turn off this checking at some optimization level or with compile options like '- ffast-math' and '-fno-wrapv'? 5. Is there any way to extend the function of check_ext_dependent_givs to manage non-iteration-variable BIVs in non-constant-iteration loops. I have tried but failed. Best regards, Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
Re: SMS in gcc4.0
Hi, all I've taken a look on modulo-sched.c recently, and found that both new_cycles and orig_cycles are imprecise. The reason is that kernel_number_of_cycles does not take the data dependences of insns into account as the DFA scheduler does in haifa-sched.c. On IA-64, three improvements are needed to let SMS work. 1) Modify doloop_register_get or the similar function defined in doloop.c to recognize the loop count register. I have supplied a patch about this in April. 2) Use more precise way to calculate the values of the two kind of cycles, or just ignore this benefit assertion. 3) The counted loop register 'ar.lc' of IA-64 can not be updated directly. Another temporary register is needed to evaluate the value of the actural loop count after SMS schedule, and assign its value to 'ar.lc'. Mostafa Hagog <[EMAIL PROTECTED]>: > > > > > Steven Bosscher <[EMAIL PROTECTED]> wrote on 22/04/2005 09:39:09: > > > > > > Thanks! > > For the record, this refers to a patch I sent to Mostafa and Canqun to > > do what Mostafa suggested last month to make SMS work for ia64, see > > http://gcc.gu.org/ml/gcc-patches/2005-03/msg02848.html. > > I have tested the patch on powerpc-apple-darwin and there are some tests > that > started failing. So I am going to debug it to see what causes the failures. > > Mostafa. > > > > > Gr. > > Steven > > > > > > Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
Re: SMS in gcc4.0
Steven Bosscher <[EMAIL PROTECTED]>: > On Wednesday 01 June 2005 16:43, Canqun Yang wrote: > > Hi, all > > > > I've taken a look on modulo-sched.c recently, and found > > that both new_cycles and orig_cycles are imprecise. The > > reason is that kernel_number_of_cycles does not take the > > data dependences of insns into account as the DFA > > scheduler does in haifa-sched.c. > > How does this affect the cycles computation? > An insns is ready for schedule only when all the insns it dependent on have already be scheduled. In haifa- sched.c, there is a queue to hold the insns which are ready for schedule. To find how the data dependence affect the cycles computation, the more simple way is to compare the two versions of assembly code generated by GCC respectively, one is generated by turning on '-fmodulo- sched', the other not. Without SMS, the code in loop has many stops ';;' to seperate the instrcutions which have data dependence, while with SMS, though the kernel code of the loop has more instructions, but less stops ';;'. > > On IA-64, three improvements are needed to let SMS work. > > 1) Modify doloop_register_get or the similar function > > defined in doloop.c to recognize the loop count > > register. I have supplied a patch about this in April. > > Mustafa and I have a patch that has a similar effect, see > http://gcc.gnu.org/ml/gcc-patches/2005- 06/msg00035.html. > > > 2) Use more precise way to calculate the values of the > > two kind of cycles, or just ignore this benefit assertion. > > Probably need to be more precise :-/ > > When I manually hacked modulo-sched.c to ignore this test, I > did see loops getting scheduled, but I also ran into ICEs in > cfglayout. There are no ICEs for pi.f90, swim.f, and mgrid.f according to my test. But, an internal compile error of 'unrecognizable insn' is produced by 'gen_sub2_insn' which explicitly minus 'ar.lc' when swim.f and mgrid.f are being compiled. > > > 3) The counted loop register 'ar.lc' of IA-64 can not be > > updated directly. Another temporary register is needed > > to evaluate the value of the actural loop count after > > SMS schedule, and assign its value to 'ar.lc'. > > Actually, should SMS just not update the loop register in place? > I never figured out why it tries to produce a sub insns (using > gen_sub2_insn which is also wrong btw). > The current implementation of SMS does not use IA-64's epilog register (ar.ec). After SMS, the loop count is just used to control the execution times of the kernel code, and the kernel code will execute loop_count - (stage_count - 1) times The sub insns generated by gen_sub2_insn is used to produce this value. > Gr. > Steven > > Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
Re: SMS in gcc4.0
Canqun Yang <[EMAIL PROTECTED]>: > Steven Bosscher <[EMAIL PROTECTED]>: > > > On Wednesday 01 June 2005 16:43, Canqun Yang wrote: > > > Hi, all > > > > > > I've taken a look on modulo-sched.c recently, and > found > > > that both new_cycles and orig_cycles are > imprecise. The > > > reason is that kernel_number_of_cycles does not > take the > > > data dependences of insns into account as the DFA > > > scheduler does in haifa-sched.c. > > > > How does this affect the cycles computation? > > > > An insns is ready for schedule only when all the insns > it dependent on have already be scheduled. In haifa- > sched.c, there is a queue to hold the insns which are > ready for schedule. > > To find how the data dependence affect the cycles > computation, the more simple way is to compare the > two versions of assembly code generated by GCC > respectively, one is generated by turning on '- fmodulo- > sched', the other not. Without SMS, the code in loop > has many stops ';;' to seperate the instrcutions which > have data dependence, while with SMS, though the > kernel code of the loop has more instructions, but > less stops ';;'. > > > > On IA-64, three improvements are needed to let SMS > work. > > > 1) Modify doloop_register_get or the similar > function > > > defined in doloop.c to recognize the loop count > > > register. I have supplied a patch about this in > April. > > > > Mustafa and I have a patch that has a similar > effect, see > > http://gcc.gnu.org/ml/gcc-patches/2005- > 06/msg00035.html. > > > > > 2) Use more precise way to calculate the values of > the > > > two kind of cycles, or just ignore this benefit > assertion. > > > > Probably need to be more precise :-/ > > > > When I manually hacked modulo-sched.c to ignore this > test, I > > did see loops getting scheduled, but I also ran into > ICEs in > > cfglayout. > > There are no ICEs for pi.f90, swim.f, and mgrid.f > according to my test. But, an internal compile error > of 'unrecognizable insn' is produced > by 'gen_sub2_insn' which explicitly minus 'ar.lc' when > swim.f and mgrid.f are being compiled. There is no ICEs for pi.f90 according to my test. But ICEs of 'unreconizable insn' is procuded by 'gen_sub2_insns' which explicitly minus 'ar.lc' when swim.f and mgrid.f are being compiled. > > > > > > 3) The counted loop register 'ar.lc' of IA-64 can > not be > > > updated directly. Another temporary register is > needed > > > to evaluate the value of the actural loop count > after > > > SMS schedule, and assign its value to 'ar.lc'. > > > > Actually, should SMS just not update the loop > register in place? > > I never figured out why it tries to produce a sub > insns (using > > gen_sub2_insn which is also wrong btw). > > > > The current implementation of SMS does not use IA- 64's > epilog register (ar.ec). After SMS, the loop count is > just used to control the execution times of the kernel > code, and the kernel code will execute >loop_count - (stage_count - 1) times > The sub insns generated by gen_sub2_insn is used to > produce this value. > > > > Gr. > > Steven > > > > > Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
Function Inlining for FORTRAN
Hi, all Function inlining for FORTRAN programs always fails. If no one engages in it, I will give a try. Would you please give me some clues? Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
Re: Function Inlining for FORTRAN
Paul Brook <[EMAIL PROTECTED]>: > On Wednesday 20 July 2005 15:35, Canqun Yang wrote: > > Hi, all > > > > Function inlining for FORTRAN programs always fails. > > Not entirely true. Inlining of contained procedures works fine (or it did la > st > time I checked). This should include inlining of siblings within a module. > > > If no one engages in it, I will give a try. Would you please give me > > some clues? > > The problem is that each top level program unit (PU)[1] is compiled > separately. Each PU has it's own "external" decls for all function calls, > even if the function happens to be in the same function. Thus each PU is an > > isolated self-contained tree structure, and the callgraph doesn't know the > definition and declaration are actually the same thing. > > Basically what you need to do is parse the whole file, then start generating > > code. > > Unfortunately this isn't simple (or it would have been fixed already!). > Unlike C Fortran doesn't have file-level scope. It makes absolutely no > difference whether two procedures are in the same file, or in different > files. You get all the problems that multifile IPA in C experiences within > a > single Fortran file. > > The biggest problem is type consistency and aliasing. Consider the following > I have several FORTRAN 77 programs. After inlining the small functions in them by hand, they made a great performance improvements. So I need a trial implementation of function inlining to verify the effectiveness of it. Now, my question is: If we just take the FORTRAN 77 syntax into account (no derived types, no complex aliasing), may it be simpler to implement function inlining for FORTRAN 77. > > Paul > Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
Re: IPA branch
Hi, Patch from Michael Matz (http://gcc.gnu.org/ml/fortran/2005-07/msg00331.html) may partly fixes the multiple decls problems. I've tested and tuned this patch. It works, small functions can be inlined after DECL_INLINE flags (build_function_decl in trans-decl.c) have been set for them. The only regression is FORTRAN 95 testcase function_modulo_1.f90, it produces a wong result. Canqun Yang Creative Compiler Research Group. National University of Defense Technology, China.
[patch] Improve loop array prefetch for IA-64
Hi, all This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS benchmark suite on Itanium-2 system, respectively. More performance increase is hopeful by further tuning the parameters and improving the prefetch algorithm at tree level. Details of NAS benchmarks are listed below. GCC options: -O3 -fprefetch-loop-arrays Target: Itanium-2 1.6GHz; L2 Cache 256K, L3 Cache 6M Execution times in seconds -this patch +this patch bt.W 14.4314.17 cg.A 13.766.86 ep.W 7.83 7.79 ft.A 18.7320.15 is.B 11.8510.94 lu.W 20.5520.27 mg.A 15.0911.86 sp.W 37.1135.49 geomean15.8413.94 speedup 13.68% 2006-06-02 Canqun Yang <[EMAIL PROTECTED]> * config/ia64/ia64.h (SIMULTANEOUS_PREFETCHES): Define to 18. (PREFETCH_BLOCK): Define to 128. (PREFETCH_LATENCY): Define to 400. Index: ia64.h === --- ia64.h (revision 114307) +++ ia64.h (working copy) @@ -1985,13 +1985,18 @@ ??? This number is bogus and needs to be replaced before the value is actually used in optimizations. */ -#define SIMULTANEOUS_PREFETCHES 6 +#define SIMULTANEOUS_PREFETCHES 18 /* If this architecture supports prefetch, define this to be the size of the cache line that is prefetched. */ -#define PREFETCH_BLOCK 32 +#define PREFETCH_BLOCK 128 +/* A number that should roughly corresponding to the nunmber of instructions + executed before the prefetch is completed. */ + +#define PREFETCH_LATENCY 400 + #define HANDLE_SYSV_PRAGMA 1 /* A C expression for the maximum number of instructions to execute via Canqun Yang __ 赶快注册雅虎超大容量免费邮箱? http://cn.mail.yahoo.com
Re: [patch] Improve loop array prefetch for IA-64
--- Andrey Belevantsev <[EMAIL PROTECTED]>: > Canqun Yang wrote: > > Hi, all > > > > This patch results a performance increase of 4% for SPECfp2000 and 13% for > > NAS benchmark suite > on > > Itanium-2 system, respectively. More performance increase is hopeful by > > further tuning the > > parameters and improving the prefetch algorithm at tree level. > > Hi Canqun, > > It's great news that you continued to work on prefetching tuning for > ia64! Do you plan to port your other changes for the old RTL > prefetching to the tree level? > Yes. But I have no much time to do it now. I am busy for other things. > > @@ -1985,13 +1985,18 @@ > > ??? This number is bogus and needs to be replaced before the value is > > actually used in optimizations. */ > > I suggest to remove this comment as it has become outdated with your > patch. Instead you might say how did you choose this particular value > (and PREFETCH_BLOCK too). Just my 2c. > > Andrey > > Please refer to my previous mail and attatched paper. Canqun Yang __ 赶快注册雅虎超大容量免费邮箱? http://cn.mail.yahoo.com
RE: [patch] Improve loop array prefetch for IA-64
--- "Davis, Mark" <[EMAIL PROTECTED]>: > Canqun, > > Nice job getting this ready for the current version of gcc! > > Question: does gcc now know the difference between prefetching to cache L1 > via "lfetch", as > opposed to prefetching only to level L2 via "lfetch.nt1"? For floating point > data, the latter > is the only interesting case because float loads only access the L2. Thus > using "lfetch" for > floating point arrays will unnecessarily wipe out the contents of L1. (gcc > 3.2.3 only seems to > generate "lfetch", which is why I ask...) > Yes, GCC does. I have tried this on the old prefetch implementation at RTL level and the new one at TREE level, but no significant performance difference for SPECfp2000 and NAS benchmarks. Nevertheless, it worth taking more time to inspect it. Canqun Yang > Thanks, > Mark > > -Original Message- > From: Canqun Yang [mailto:[EMAIL PROTECTED] > Sent: Friday, June 02, 2006 5:14 AM > To: gcc@gcc.gnu.org; [EMAIL PROTECTED] > Subject: [patch] Improve loop array prefetch for IA-64 > > Hi, all > > This patch results a performance increase of 4% for SPECfp2000 and 13% for > NAS benchmark suite > on > Itanium-2 system, respectively. More performance increase is hopeful by > further tuning the > parameters and improving the prefetch algorithm at tree level. > > > Canqun Yang > > __ 赶快注册雅虎超大容量免费邮箱? http://cn.mail.yahoo.com
The execution times of each function call in call graph
Hi, all Is there any way to get the (estimated) execution times of each function call during IPA passes? Currently, in GCC, the loop information can only be formed after tree-ssa pass by calling loop_optimizer_init, so it is impossible to estimated the times of a function call when the IPA optimizations, like inlining, are executed. Am I right? Canqun ___ 雅虎免费邮箱-3.5G容量,20M附件 http://cn.mail.yahoo.com/
relocation truncated to fit
Hi, all Can anyone help me to resolve this problem? When I compile a program with .bss segement larger than 2.0GB, I get the following error message from GNU linker (binutils-2.15). (.text+0x305): In function `sta_': : relocation truncated to fit: R_X86_64_32S plot_ .. I upgrade the assembler and the linker from binutis-2.17, then get the message below. STA.o: In function `sta_': STA.F:(.text+0x305): relocation truncated to fit: R_X86_64_32S against symbol `plot_' defined in COMMON section in STA.o So, I modified the binutils-2.17/bfd/elf64-x86-64.c and rebuild the linker to ignore the relocation errors. Though the executable generated, segementation fault occurred during execution. Here is the configuration of my computer: CPU: Intel(R) Xeon(R) CPU5150 @ 2.66GHz OS: Linux mds 2.6.9-34.EL_lustre.1.4.6.1custom #3 SMP Fri Jul 13 15:27:27 CST 2007 x86_64 x86_64 x86_64 GNU/Linux Compiler: Intel C++/Fortran compiler for linux 10.0 I also wrote a program with large uninitialized data -- more than 2.0GB. It passes after linked with the modified linker. The source code is appended. #define N 0x05fff double a[N][N]; int main () { int i, j; double sum; for (i = 0; i < N; i+=5) for (j = 0; j < N; j+=5) a[i][j] = 2* i*j + i*i + j*j; sum = 0.0; for (i = 0; i < N; i+=5) for (j = 0; j < N; j+=5) sum += a[i][j]; printf ("%f\n", sum); } Best regards, Canqun Yang ___ 抢注雅虎免费邮箱3.5G容量,20M附件! http://cn.mail.yahoo.com
Re: relocation truncated to fit
Hi, Guenther It works. Thank you very much! Canqun Yang --- Richard Guenther <[EMAIL PROTECTED]>: > On 7/26/07, Canqun Yang <[EMAIL PROTECTED]> wrote: > > Hi, all > > > > Can anyone help me to resolve this problem? > > > > When I compile a program with .bss segement larger than 2.0GB, I get the > > following error message from GNU linker (binutils-2.15). > > > > (.text+0x305): In function `sta_': > > : relocation truncated to fit: R_X86_64_32S plot_ > > .. > > > > I upgrade the assembler and the linker from binutis-2.17, then get the > > message below. > > > > STA.o: In function `sta_': > > STA.F:(.text+0x305): relocation truncated to fit: R_X86_64_32S against > > symbol `plot_' defined in COMMON section in STA.o > > > > So, I modified the binutils-2.17/bfd/elf64-x86-64.c and rebuild the linker > > to ignore the relocation errors. Though the executable generated, > > segementation fault occurred during execution. > > > > Here is the configuration of my computer: > > > > CPU: Intel(R) Xeon(R) CPU5150 @ 2.66GHz > > OS: Linux mds 2.6.9-34.EL_lustre.1.4.6.1custom #3 SMP Fri Jul 13 15:27:27 > > CST 2007 x86_64 x86_64 x86_64 GNU/Linux > > Compiler: Intel C++/Fortran compiler for linux 10.0 > > > > I also wrote a program with large uninitialized data -- more than 2.0GB. > > It passes after linked with the modified linker. The source code is > > appended. > > Try using -mcmodel=medium > > Richard. > > > #define N 0x05fff > > > > double a[N][N]; > > > > int > > main () > > { > > int i, j; > > double sum; > > > > for (i = 0; i < N; i+=5) > > for (j = 0; j < N; j+=5) > > a[i][j] = 2* i*j + i*i + j*j; > > > > > > sum = 0.0; > > for (i = 0; i < N; i+=5) > > for (j = 0; j < N; j+=5) > > sum += a[i][j]; > > > > printf ("%f\n", sum); > > } > > > > > > Best regards, > > > > Canqun Yang ___ 抢注雅虎免费邮箱3.5G容量,20M附件! http://cn.mail.yahoo.com