Why does gcc generate const local array on stack?
Hi, I came across the following issue. int foo (int N) { const int a[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; return a[N]; } Compile with x86 O2 foo: .LFB0: .cfi_startproc movslq %edi, %rdi movl $0, -56(%rsp) movl $1, -52(%rsp) movl $2, -48(%rsp) movl $3, -44(%rsp) movl $4, -40(%rsp) movl $5, -36(%rsp) movl $6, -32(%rsp) movl $7, -28(%rsp) movl $8, -24(%rsp) movl $9, -20(%rsp) movl -56(%rsp,%rdi,4), %eax ret The array is placed on stack and GCC has to generate a sequence of instructions to initialize the array every time the function is called. On the contrary, LLVM moves the array to global data and doesn't need initialization within the function. If I add static to the array, GCC behaves the same as LLVM, just as expected. Is there some subtle C standard issue or some switch I didn't turned on? I understand if this function is recursive and pointer of the array is involved, GCC would have to maintain the array on stack and hence the initialization. But here the code is very simple. I don't understand the logic of generated code, or maybe missing optimization opportunity? Thanks, Bingfeng Mei
Re: Re: Why does gcc generate const local array on stack?
I agree with you on this example. But in my original code, as Jonathan pointed out, is not recursive and the address of "a" does not escape the function in any way. I believe it is valid transformation. BTW, LLVM compiles your example still by moving const array to rodata, which I think is wrong and will fail the test. Cheers, Bingfeng On Thu, Apr 21, 2016 at 3:41 AM, lh_mouse wrote: > See this example: http://coliru.stacked-crooked.com/a/048b4aa5046da11b > > In this example the function is called recursively. > During each call a pointer to that local areay is appended to a static array > of pointers. > Should a new instance of that local array of const int be created every time, > abort() will never be called. > Since calling a library function is observable behavior, clang's optimization > has effectively changed that program's behavior. Hence I think it is wrong. > > [code] > #include > > static const int *ptrs[2]; > static unsigned recur; > > void foo(){ > const int a[] = {0,1,2,3,4,5,6,7,8,9}; > ptrs[recur] = a; > if(recur == 0){ > ++recur; > foo(); > } > if(ptrs[0] == ptrs[1]){ > abort(); > } > } > > int main(){ > foo(); > } > [/code] > > -- > Best regards, > lh_mouse > 2016-04-21 > > ----- > 发件人:Jonathan Wakely > 发送日期:2016-04-21 01:51 > 收件人:lh_mouse > 抄送:Bingfeng Mei,gcc > 主题:Re: Why does gcc generate const local array on stack? > > On 20 April 2016 at 18:31, lh_mouse wrote: >> I tend to say clang is wrong here. > > If you can't detect the difference then it is a valid transformation. > >> Your identifier 'a' has no linkage. Your object designated by 'a' does not >> have a storage-class specifier. >> So it has automatic storage duration and 6.2.4/7 applies: 'If the scope is >> entered recursively, a new instance of the object is created each time.' > > How do you tell the difference between a const array that is recreated > each time and one that isn't? > >> Interesting enough, ISO C doesn't say whether distinct objects should have >> distinct addresses. >> It is worth noting that this is explicitly forbidden in ISO C++ because >> distinct complete objects shall have distinct addresses: > > If the object's address doesn't escape from the function then I can't > think of a way to tell the difference. > >
Re: Re: Why does gcc generate const local array on stack?
Richard, thanks for explanation. I found an option -fmerge-all-constants, which can help me work around for now. BIngfeng On Thu, Apr 21, 2016 at 11:15 AM, Richard Biener wrote: > On Thu, Apr 21, 2016 at 11:39 AM, Jonathan Wakely > wrote: >> On 21 April 2016 at 03:41, lh_mouse wrote: >>> See this example: http://coliru.stacked-crooked.com/a/048b4aa5046da11b >>> >>> In this example the function is called recursively. >> >> See the original email you replied to: >> >> "I understand if this function is recursive and pointer of the array >> is involved, GCC would have to maintain the array on stack and hence >> the initialization." >> >> The question is about cases where that doesn't happen. > > The decision on whether to localize the array and inline the init is > done at gimplification time. > The plan is to delay this until SRA which could then also apply the > desired optimization > of removing the local in case it is never written to. > > Richard.
Why is this not optimized?
Hi, I am looking at some code of our target, which is not optimized as expected. For the following RTX, I expect source of insn 17 should be propagated into insn 20, and insn 17 is eliminated as a result. On our target, it will become a predicated xor instruction instead of two. Initially, I thought fwprop pass should do this. (insn 17 16 18 3 (set (reg/v:HI 102 [ crc ]) (xor:HI (reg/v:HI 108 [ crc ]) (const_int 16386 [0x4002]))) coremark.c:1632 725 {xorhi3} (nil)) (insn 18 17 19 3 (set (reg:BI 113) (ne:BI (reg:QI 101 [ D.4446 ]) (const_int 1 [0x1]))) 1397 {cmp_qimode} (nil)) (jump_insn 19 18 55 3 (set (pc) (if_then_else (ne (reg:BI 113) (const_int 0 [0])) (label_ref 23) (pc))) 1477 {cbranchbi4} (expr_list:REG_DEAD (reg:BI 113) (expr_list:REG_BR_PROB (const_int 7100 [0x1bbc]) (expr_list:REG_PRED_WIDTH (const_int 1 [0x1]) (nil -> 23) (note 55 19 20 4 [bb 4] NOTE_INSN_BASIC_BLOCK) (insn 20 55 23 4 (set (reg:HI 112 [ crc ]) (reg/v:HI 102 [ crc ])) 502 {fp_movhi} (expr_list:REG_DEAD (reg/v:HI 102 [ crc ]) (nil))) (code_label 23 20 56 5 2 "" [1 uses]) But it can't. First propagate_rtx_1 will return false because PR_CAN_APPEAR is false and following code is executed. if (x == old_rtx) { *px = new_rtx; return can_appear; } Even I forces PR_CAN_APPEAR to be set in flags, fwprop still won't go ahead in try_fwprpp_subst because old_cost is 0 (REG only rtx), and set_src_cost (SET_SRC (set), speed) is bigger than 0. So the change is deemed as not profitable, which is not correct IMO. If fwprop is not the place to do this optimization, where should it be done? I am working on up-to-date GCC 4.8. Thanks, Bingfeng Mei
RE: Why is this not optimized?
Thanks for the reply. I will look at the patch. As far as the cost is concerned, I think fwprop doesn't really need to understand pipeline model. As long as rtx costs after optimization is less than before optimization, I think it is good enough. Of course, it won't be better in every case, but should be better in general. Cheers, Bingfeng -Original Message- From: Bin.Cheng [mailto:amker.ch...@gmail.com] Sent: 15 May 2014 06:59 To: Bingfeng Mei Cc: gcc@gcc.gnu.org Subject: Re: Why is this not optimized? On Wed, May 14, 2014 at 9:14 PM, Bingfeng Mei wrote: > Hi, > I am looking at some code of our target, which is not optimized as expected. > For the following RTX, I expect source of insn 17 should be propagated into > insn 20, and insn 17 is eliminated as a result. On our target, it will become > a predicated xor instruction instead of two. Initially, I thought fwprop pass > should do this. > > (insn 17 16 18 3 (set (reg/v:HI 102 [ crc ]) > (xor:HI (reg/v:HI 108 [ crc ]) > (const_int 16386 [0x4002]))) coremark.c:1632 725 {xorhi3} > (nil)) > (insn 18 17 19 3 (set (reg:BI 113) > (ne:BI (reg:QI 101 [ D.4446 ]) > (const_int 1 [0x1]))) 1397 {cmp_qimode} > (nil)) > (jump_insn 19 18 55 3 (set (pc) > (if_then_else (ne (reg:BI 113) > (const_int 0 [0])) > (label_ref 23) > (pc))) 1477 {cbranchbi4} > (expr_list:REG_DEAD (reg:BI 113) > (expr_list:REG_BR_PROB (const_int 7100 [0x1bbc]) > (expr_list:REG_PRED_WIDTH (const_int 1 [0x1]) > (nil > -> 23) > (note 55 19 20 4 [bb 4] NOTE_INSN_BASIC_BLOCK) > (insn 20 55 23 4 (set (reg:HI 112 [ crc ]) > (reg/v:HI 102 [ crc ])) 502 {fp_movhi} > (expr_list:REG_DEAD (reg/v:HI 102 [ crc ]) > (nil))) > (code_label 23 20 56 5 2 "" [1 uses]) > > > But it can't. First propagate_rtx_1 will return false because PR_CAN_APPEAR > is false and > following code is executed. > > if (x == old_rtx) > { > *px = new_rtx; > return can_appear; > } > > Even I forces PR_CAN_APPEAR to be set in flags, fwprop still won't go ahead in > try_fwprpp_subst because old_cost is 0 (REG only rtx), and set_src_cost > (SET_SRC (set), > speed) is bigger than 0. So the change is deemed as not profitable, which is > not correct > IMO. Pass fwprop is too conservative with respect to propagation opportunities outside of memory reference, it just gives up at many places. Also as in your case, seems it doesn't take into consideration that the original insn can be removed after propagation. We Mi once sent a patch re-implementing fwprop pass at https://gcc.gnu.org/ml/gcc-patches/2013-03/msg00617.html . I also did some experiments and worked out a local patch doing similar work to handle cases exactly like yours. The problem is even though one instruction can be saved (as in your case), it's not always good, because it tends to generate more complex instructions, and such insns are somehow more vulnerable to pipeline hazard. Unfortunately, it's kind of impossible for fwprop to understand the pipeline risk. Thanks, bin > > If fwprop is not the place to do this optimization, where should it be done? > I am working on up-to-date GCC 4.8. > > Thanks, > Bingfeng Mei -- Best Regards.
RE: Register Pressure guided Unroll and Jam in GCC !!
That is true. Early estimation of register pressure should be improved. Right now I am looking at an example IVOPTS produces too many induction variables and causes a lot of register spilling. Though ivopts pass called estimate_reg_pressure_cost function, results are not even close to real situation. Bingfeng -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Vladimir Makarov Sent: 16 June 2014 19:37 To: Ajit Kumar Agarwal; gcc@gcc.gnu.org Cc: Michael Eager; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Register Pressure guided Unroll and Jam in GCC !! On 2014-06-16, 10:14 AM, Ajit Kumar Agarwal wrote: > Hello All: > > I have worked on the Open64 compiler where the Register Pressure Guided > Unroll and Jam gave a good amount of performance improvement for the C and > C++ Spec Benchmark and also Fortran benchmarks. > > The Unroll and Jam increases the register pressure in the Unrolled Loop > leading to increase in the Spill and Fetch degrading the performance of the > Unrolled Loop. The Performance of Cache locality achieved through Unroll and > Jam is degraded with the presence of Spilling instruction due to increases in > register pressure Its better to do the decision of Unrolled Factor of the > Loop based on the Performance model of the register pressure. > > Most of the Loop Optimization Like Unroll and Jam is implemented in the High > Level IR. The register pressure based Unroll and Jam requires the calculation > of register pressure in the High Level IR which will be similar to register > pressure we calculate on Register Allocation. This makes the implementation > complex. > > To overcome this, the Open64 compiler does the decision of Unrolling to both > High Level IR and also at the Code Generation Level. Some of the decisions > way at the end of the Code Generation . The advantage of using this approach > like Open64 helps in using the register pressure information calculated by > the Register Allocator. This helps the implementation much simpler and less > complex. > > Can we have this approach in GCC of the Decisions of Unroll and Jam in the > High Level IR and also to defer some of the decision at the Code Generation > Level like Open64? > > Please let me know what do you think. > Most loop optimizations are a good target for register pressure sensitive algorithms as loops are usually program hot spots and any pressure decrease there would be harmful as any RA can not undo such complex transformations. So I guess your proposal could work. Right now we have only pressure-sensitive modulo scheduling (SMS) and loop-invariant motion (as I remember switching from loop-invariant motion based on some very inaccurate register-pressure evaluation to one based on RA pressure evaluation gave a nice improvement about 1% for SPECFP2000 on some targets).
regs_used estimation in IVOPTS seriously flawed
Hi, I am looking at a performance regression in our code. A big loop produces and uses a lot of temporary variables inside the loop body. The problem appears that IVOPTS pass creates even more induction variables (from original 2 to 27). It causes a lot of register spilling later and performance take a severe hit. I looked into tree-ssa-loop-ivopts.c, it does call estimate_reg_pressure_cost function to take # of registers into consideration. The second parameter passed as data->regs_used is supposed to represent old register usage before IVOPTS. return size + estimate_reg_pressure_cost (size, data->regs_used, data->speed, data->body_includes_call); In this case, it is mere 2 by following calculation. Essentially, it only counts all loop invariant registers, ignoring all registers produced/used inside the loop. n = 0; for (psi = gsi_start_phis (loop->header); !gsi_end_p (psi); gsi_next (&psi)) { phi = gsi_stmt (psi); op = PHI_RESULT (phi); if (virtual_operand_p (op)) continue; if (get_iv (data, op)) continue; n++; } EXECUTE_IF_SET_IN_BITMAP (data->relevant, 0, j, bi) { struct version_info *info = ver_info (data, j); if (info->inv_id && info->has_nonlin_use) n++; } data->regs_used = n; I believe how regs_used is calculated is seriously flawed, or estimate_reg_pressure_cost is problematic if n_old is only supposed to be loop invariant registers. Either way, it affects how IVOPTS makes decision and could result in worse code. What do you think? Any idea on how to improve this? Thanks, Bingfeng
RE: regs_used estimation in IVOPTS seriously flawed
> -Original Message- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: 18 June 2014 12:36 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: regs_used estimation in IVOPTS seriously flawed > > On Tue, Jun 17, 2014 at 4:59 PM, Bingfeng Mei wrote: > > Hi, > > I am looking at a performance regression in our code. A big loop > produces > > and uses a lot of temporary variables inside the loop body. The > problem > > appears that IVOPTS pass creates even more induction variables (from > original > > 2 to 27). It causes a lot of register spilling later and performance > > take a severe hit. I looked into tree-ssa-loop-ivopts.c, it does call > > estimate_reg_pressure_cost function to take # of registers into > > consideration. The second parameter passed as data->regs_used is > supposed > > to represent old register usage before IVOPTS. > > > > return size + estimate_reg_pressure_cost (size, data->regs_used, > data->speed, > > data->body_includes_call); > > > > In this case, it is mere 2 by following calculation. Essentially, it > only counts > > all loop invariant registers, ignoring all registers produced/used > inside the loop. > > > > n = 0; > > for (psi = gsi_start_phis (loop->header); !gsi_end_p (psi); gsi_next > (&psi)) > > { > > phi = gsi_stmt (psi); > > op = PHI_RESULT (phi); > > > > if (virtual_operand_p (op)) > > continue; > > > > if (get_iv (data, op)) > > continue; > > > > n++; > > } > > > > EXECUTE_IF_SET_IN_BITMAP (data->relevant, 0, j, bi) > > { > > struct version_info *info = ver_info (data, j); > > > > if (info->inv_id && info->has_nonlin_use) > > n++; > > } > > > > data->regs_used = n; > > > > I believe how regs_used is calculated is seriously flawed, > > or estimate_reg_pressure_cost is problematic if n_old is > > only supposed to be loop invariant registers. Either way, > > it affects how IVOPTS makes decision and could result in > > worse code. What do you think? Any idea on how to improve > > this? > > Well, it's certainly a lower bound on the number of registers > live through the whole loop execution (thus over the backedge). > So they have the same cost as an induction variable as far > as register pressure is concerned. > > What it doesn't account for is the maximum number of live > registers anywhere in the loop body - but that is hard to > estimate at this point in the compilation. You could compute > the maximum number of live SSA names which could be > an upper bound on the register pressure - but that needs > liveness analysis which is expensive also that upper bound > is probably way too high. > Yes, I agree it is hard and probably expensive at this stage of compilation to do accurate analysis. But it could be quite useful for many tree-level loop optimizations, even just a half-accurate estimation for register pressure, as also discussed in another thread a few days ago. > So I think the current logic is sensible and simple. It's just > not perfect. > > Maybe it's just the cost function of the IV set choosen that > needs to be adjusted to account for the number of IVs > in a non-linear way? That is, adjust ivopts_global_cost_for_size > which just adds size to sth that pessimizes more IVs even > more like size * (1 + size / (1 + data->regs_used)) or > simply size ** (1. + eps) with a suitable eps < 2. > I am going to try a few cost functions as you suggested. Maybe also just count all SSA together and divide it by a factor. Thanks, Bingfeng
RE: regs_used estimation in IVOPTS seriously flawed
> -Original Message- > From: Bin.Cheng [mailto:amker.ch...@gmail.com] > Sent: 20 June 2014 06:25 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: regs_used estimation in IVOPTS seriously flawed > > On Tue, Jun 17, 2014 at 10:59 PM, Bingfeng Mei wrote: > > Hi, > > I am looking at a performance regression in our code. A big loop > produces > > and uses a lot of temporary variables inside the loop body. The > problem > > appears that IVOPTS pass creates even more induction variables (from > original > > 2 to 27). It causes a lot of register spilling later and performance > Do you have a simplified case which can be posted here? I guess it > affects some other targets too. > > > take a severe hit. I looked into tree-ssa-loop-ivopts.c, it does call > > estimate_reg_pressure_cost function to take # of registers into > > consideration. The second parameter passed as data->regs_used is > supposed > > to represent old register usage before IVOPTS. > > > > return size + estimate_reg_pressure_cost (size, data->regs_used, > data->speed, > > data->body_includes_call); > > > > In this case, it is mere 2 by following calculation. Essentially, it > only counts > > all loop invariant registers, ignoring all registers produced/used > inside the loop. > There are two kinds of registers produced/used inside the loop. One > is induction variable irrelevant, it includes non-linear uses as > mentioned by Richard. The other kind relates to induction variable > rewrite, and one issue with this kind is expression generated during > iv use rewriting is not reflecting the estimated one in ivopt very > well. > As a short term solution, I tried some simple non-linear functions as Richard suggested to penalize using too many IVs. For example, the following cost in ivopts_global_cost_for_size fixed my regression and actually improves performance slightly over a set of benchmarks we usually use. return size * (1 + size * 0.2) + estimate_reg_pressure_cost (size, data->regs_used, data->speed, data->body_includes_call); The trouble is choice of this non-linear function could be highly target dependent (# of registers?). I don't have setup to prove performance gain for other targets. I also tried counting all SSA names and divide it by a factor. It does seem to work so well. Long term, if we have infrastructure to analyze maximal live variable in a loop at tree-level, that would be great for many loop optimizations. Thanks, Bingfeng
RE: regs_used estimation in IVOPTS seriously flawed
Sorry, typo in previous mail. "I also tried counting all SSA names and divide it by a factor. It does NOT seem to work so well" > -Original Message- > From: Bin.Cheng [mailto:amker.ch...@gmail.com] > Sent: 20 June 2014 10:19 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: regs_used estimation in IVOPTS seriously flawed > > On Fri, Jun 20, 2014 at 5:01 PM, Bingfeng Mei wrote: > > > > > >> -Original Message- > >> From: Bin.Cheng [mailto:amker.ch...@gmail.com] > >> Sent: 20 June 2014 06:25 > >> To: Bingfeng Mei > >> Cc: gcc@gcc.gnu.org > >> Subject: Re: regs_used estimation in IVOPTS seriously flawed > >> > >> On Tue, Jun 17, 2014 at 10:59 PM, Bingfeng Mei > wrote: > >> > Hi, > >> > I am looking at a performance regression in our code. A big loop > >> produces > >> > and uses a lot of temporary variables inside the loop body. The > >> problem > >> > appears that IVOPTS pass creates even more induction variables > (from > >> original > >> > 2 to 27). It causes a lot of register spilling later and > performance > >> Do you have a simplified case which can be posted here? I guess it > >> affects some other targets too. > >> > >> > take a severe hit. I looked into tree-ssa-loop-ivopts.c, it does > call > >> > estimate_reg_pressure_cost function to take # of registers into > >> > consideration. The second parameter passed as data->regs_used is > >> supposed > >> > to represent old register usage before IVOPTS. > >> > > >> > return size + estimate_reg_pressure_cost (size, data->regs_used, > >> data->speed, > >> > data- > >body_includes_call); > >> > > >> > In this case, it is mere 2 by following calculation. Essentially, > it > >> only counts > >> > all loop invariant registers, ignoring all registers produced/used > >> inside the loop. > >> There are two kinds of registers produced/used inside the loop. One > >> is induction variable irrelevant, it includes non-linear uses as > >> mentioned by Richard. The other kind relates to induction variable > >> rewrite, and one issue with this kind is expression generated during > >> iv use rewriting is not reflecting the estimated one in ivopt very > >> well. > >> > > > > As a short term solution, I tried some simple non-linear functions as > Richard suggested > > Oh, I misread the non-linear way as non-linear iv uses. > > > to penalize using too many IVs. For example, the following cost in > > ivopts_global_cost_for_size fixed my regression and actually improves > performance > > slightly over a set of benchmarks we usually use. > > Great, I will try to tweak it on ARM. > > > > > return size * (1 + size * 0.2) > > + estimate_reg_pressure_cost (size, data->regs_used, data- > >speed, > >data- > >body_includes_call); > > > > The trouble is choice of this non-linear function could be highly > target dependent > > (# of registers?). I don't have setup to prove performance gain for > other targets. > > > > I also tried counting all SSA names and divide it by a factor. It does > seem to work > > So the number currently computed is the lower bound which is too > small. Maybe it's possible to do some analysis with relatively low > cost increasing the number somehow. While on the other hand, doesn't > bring restriction to IVOPT for loops with low register pressure. > > Thanks, > bin > > > so well. > > > > Long term, if we have infrastructure to analyze maximal live variable > in a loop > > at tree-level, that would be great for many loop optimizations. > > > > Thanks, > > Bingfeng > > > > -- > Best Regards.
RE: Comparison of GCC-4.9 and LLVM-3.4 performance on SPECInt2000 for x86-64 and ARM
Thanks for nice benchmarks. Vladimir. Why is GCC code size so much bigger than LLVM? Does -Ofast have more unrolling on GCC? It doesn't seem increasing code size help performance (164.gzip & 197.parser) Is there comparisons for O2? I guess that is more useful for typical mobile/embedded programmers. Bingfeng > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of > Vladimir Makarov > Sent: 24 June 2014 16:07 > To: Ramana Radhakrishnan; gcc.gcc.gnu.org > Subject: Re: Comparison of GCC-4.9 and LLVM-3.4 performance on > SPECInt2000 for x86-64 and ARM > > On 06/24/2014 10:57 AM, Ramana Radhakrishnan wrote: > > > > The ball-park number you have probably won't change much. > > > >>> > >> Unfortunately, that is the configuration I can use on my system > because > >> of lack of libraries for other configurations. > > > > Using --with-fpu={neon / neon-vfpv4} shouldn't cause you ABI issues > > with libraries for any other configurations. neon / neon-vfpv4 enable > > use of the neon unit in a manner that is ABI compatible with the rest > > of the system. > > > > For more on command line options for AArch32 and how they map to > > various CPU's you might find this blog interesting. > > > > http://community.arm.com/groups/tools/blog/2013/04/15/arm-cortex-a- > processors-and-gcc-command-lines > > > > > >> > >> I don't think Neon can improve score for SPECInt2000 significantly > but > >> may be I am wrong. > > > > It won't probably improve the overall score by a large amount but some > > individual benchmarks will get some help. > > > There are some few benchmarks which benefit from autovectorization (eon > particularly). > >>> Did you add any other architecture specific options to your SPEC2k > >>> runs ? > >>> > >>> > >> No. The only options I used are -Ofast. > >> > >> Could you recommend me what best options you think I should use for > this > >> processor. > >> > > > > I would personally use --with-cpu=cortex-a15 --with-fpu=neon-vfpv4 > > --with-float=hard on this processor as that maps with the processor > > available on that particular piece of Silicon. > Thanks, Ramana. Next time, I'll try these options. > > > > Also given it's a big LITTLE system with probably kernel switching - > > it may be better to also make sure that you are always running on the > > big core. > > > The results are pretty stable. Also this version of Fedora does not > implement switching from Big to Little processors.
ivdep pragma not used in ddg.c?
Hi, I noticed recent GCC adds ivdep pragma support. We have our own implementation for ivdep for a couple of years now. As GCC implementation is much cleaner and we want to migrate to it. Ivdep is consumed in two places in our implementation, one is tree-vect-data-refs.c used by vectorizer, the other is in ddg.c, used by modulo scheduler. In GCC implementation, the former is the same, but ddg.c doesn't consume ivdep information at all. I think it is important not to draw redundant cross-iteration dependence if ivdep is specified in order to improve modulo scheduling performance. Looking at the code, I wonder whether loop->safelen still keep the correct information or whether loop structure still remain correct after so many tree/rtl passes. For example, in sms-schedule of modulo-sched.c loop_optimizer_init (LOOPS_HAVE_PREHEADERS | LOOPS_HAVE_RECORDED_EXITS); Does this mean loop structure is reinitialized? I know there is a flag (PROP_loops) which is supposed to preserve loop structure. But not sure what happens after all loop transformation (unrolling, peeling, etc). Is there a stage loop structure is rebuilt and we lost safelen (ivdep) information, or it is still safe to use in modulo scheduling pass? Thanks, Bingfeng
RE: Vector modes and the corresponding width integer mode
I don't think it is required. For example, PowerPC port supports V8SImode, but I don't see OImode. Just sometimes it could come handy to have the equal size scalar mode. Cheers, Bingfeng > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of > Matthew Fortune > Sent: 11 December 2014 13:27 > To: gcc@gcc.gnu.org > Subject: Vector modes and the corresponding width integer mode > > Hi, > > I'm working on MIPS SIMD support for MSA. Can anyone point me towards > information about the need for an integer mode of equal size to any > supported vector mode? > > I.e. if I support V4SImode is there any core GCC requirement that > TImode is also supported? > > Any guidance is appreciated. The MIPS port already has limited support > for TImode for 64-bit targets which makes it all the more difficult to > figure out if there is a relationship between vector modes and integer > modes. > > Thanks, > Matthew
LLVM disagrees with GCC on bitfield handling
Hi, Sorry if this question has been raised in past. I am running GCC testsuite for our LLVM port. There are several failures related to bitfields handling (pr32244-1.c, bitfld-3.c bitfld-5.c, etc) that LLVM disagrees with GCC. Taking pr32444-1.c as example, struct foo { unsigned long long b:40; } x; extern void abort (void); void test1(unsigned long long res) { /* The shift is carried out in 40 bit precision. */ if (x.b<<32 != res) abort (); } int main() { x.b = 0x0100; test1(0); return 0; } The target machine has int of 32-bit and long long of 64-bit. GCC expects the arithmetic shift to be performed on 40-bit precision (see above comment), whereas LLVM first cast the x.b to 64-bit unsigned long long and do the shift/comparison afterwards. I checked the standard. It says shift will do integer promotion first, which doesn't apply because 40-bit > int here, so it seems to make sense here with GCC's approach. On the other hand, you can argue when bitfield is loaded, it is cast to original type first (unsigned long long here), then do the arithmetic operation. C standard doesn't define arithmetic on arbitrary data width. So it needs operate on original data types. I am confused which approach conforms to standard, or this is just a grey area not well defined by standard. Any suggestion is greatly appreciated. Cheers, Bingfeng Mei
Re: LLVM disagrees with GCC on bitfield handling
HI, Joseph, Thanks for detailed explanation. Cheers, Bingfeng On Thu, Oct 26, 2017 at 5:11 PM, Joseph Myers wrote: > There is a line of C90 DRs and associated textual history (compare the > relevant text in C90 and C99, or see my comparison of it in WG14 reflector > message 11100 (18 Apr 2006)) to the effect of bit-fields acting like they > have a type with the given number of bits; that line is what's followed by > GCC for C. The choice of type for a bit-field (possibly separate from > declared type) was left explicitly implementation-defined after DR#315; > that is, if an implementation allows implementation-defined declared types > as permitted by C99 and later, whether the actual type of the bit-field in > question is the declared type or has the specified number of bits is also > implementation-defined. The point in DR#120 regarding assignment to > bit-fields still applies in C11: nothing other than the semantics of > conversion to a type with the given number of bits defines how the value > to be stored in a bit-field is computed if the stored value is not in > range. > > C++ chose a different route from those C90 DRs, of the width explicitly > not being part of the type of the bit-field. I don't know what if > anything in C++ explicitly resolves the C90 DR#120 issue and defines the > results of storing not-exactly-representable values in a bit-field. > > -- > Joseph S. Myers > jos...@codesourcery.com >
Vector permutation only deals with # of vector elements same as mask?
Hi, I noticed that vector permutation gets more use in GCC 4.6, which is great. It is used to handle negative step by reversing vector elements now. However, after reading the related code, I understood that it only works when the # of vector elements is the same as that of mask vector in the following code. perm_mask_for_reverse (tree-vect-stmts.c) ... mask_type = get_vectype_for_scalar_type (mask_element_type); nunits = TYPE_VECTOR_SUBPARTS (vectype); if (!mask_type || TYPE_VECTOR_SUBPARTS (vectype) != TYPE_VECTOR_SUBPARTS (mask_type)) return NULL; ... For PowerPC altivec, the mask_type is V16QI. It means that compiler can only permute V16QI type. But given the capability of altivec vperm instruction, it can permute any 128-bit type (V8HI, V4SI, etc). We just need convert in/out V16QI from given types and a bit more extra work in producing mask. Do I understand correctly or miss something here? Thanks, Bingfeng Mei
RE: Vector permutation only deals with # of vector elements same as mask?
Thanks. Another question. Is there any plan to vectorize the loops like the following ones? for (i=127; i>=0; i--) { x[i] = y[i] + z[i]; } I found that GCC trunk still cannot handle negative step for store. Even it can, it won't be efficient by introducing redundant permutations on load and store. Cheers, Bingfeng > -Original Message- > From: Ira Rosen [mailto:i...@il.ibm.com] > Sent: 10 February 2011 17:22 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: Vector permutation only deals with # of vector elements > same as mask? > > > Hi, > > "Bingfeng Mei" wrote on 10/02/2011 05:35:45 PM: > > > > Hi, > > I noticed that vector permutation gets more use in GCC > > 4.6, which is great. It is used to handle negative step > > by reversing vector elements now. > > > > However, after reading the related code, I understood > > that it only works when the # of vector elements is > > the same as that of mask vector in the following code. > > > > perm_mask_for_reverse (tree-vect-stmts.c) > > ... > > mask_type = get_vectype_for_scalar_type (mask_element_type); > > nunits = TYPE_VECTOR_SUBPARTS (vectype); > > if (!mask_type > > || TYPE_VECTOR_SUBPARTS (vectype) != TYPE_VECTOR_SUBPARTS > (mask_type)) > > return NULL; > > ... > > > > For PowerPC altivec, the mask_type is V16QI. It means that > > compiler can only permute V16QI type. But given the capability of > > altivec vperm instruction, it can permute any 128-bit type > > (V8HI, V4SI, etc). We just need convert in/out V16QI from > > given types and a bit more extra work in producing mask. > > > > Do I understand correctly or miss something here? > > Yes, you are right. The support of reverse access is somewhat limited. > Please see vect_transform_slp_perm_load() in tree-vect-slp.c for > example of > all type permutation support. > > But, anyway, reverse accesses are not supported for altivec's load > realignment scheme. > > Ira > > > > > Thanks, > > Bingfeng Mei > > > > > > > > >
Why does GCC convert short operation to short unsigned?
Hi, I noticed that GCC converts short arithmetic to unsigned short. short foo2 (short a, short b) { return a - b; } In .gimple file: foo2 (short int a, short int b) { short int D.3347; short unsigned int a.0; short unsigned int b.1; short unsigned int D.3350; a.0 = (short unsigned int) a; b.1 = (short unsigned int) b; D.3350 = a.0 - b.1; D.3347 = (short int) D.3350; return D.3347; } Is this for some C standard conformance, or optimization purpose? This doesn't happen with int type. Thanks, Bingfeng Mei
Is this correct behaviour?
Hi, I compile the following code with arm gcc 4.6 (x86 is the similar with one of 4.7 snapshot). I noticed "a" is written to memory three times instead of being added by 3 and written at the end. Doesn't restrict guarantee "a" won't be aliased to "p" so 3 "a++" can be optimized? Thanks, Bingfeng Mei int a; int P[100]; void foo (int * restrict p) { P[0] = *p; a++; P[1] = *p; a++; P[2] = *p; a++; } ~/work/install-arm/bin/arm-elf-gcc tst.c -O2 -S -std=c99 foo: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r3, .L2 ldr r1, [r3, #0] ldr ip, [r0, #0] ldr r2, .L2+4 str r4, [sp, #-4]! add r4, r1, #1 str r4, [r3, #0] str ip, [r2, #0] ldr ip, [r0, #0] add r4, r1, #2 str r4, [r3, #0] str ip, [r2, #4] ldr r0, [r0, #0] add r1, r1, #3 str r0, [r2, #8] str r1, [r3, #0] ldmfd sp!, {r4} bx lr
RE: Is this correct behaviour?
> -Original Message- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: 06 September 2011 16:42 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: Is this correct behaviour? > > On Tue, Sep 6, 2011 at 5:30 PM, Bingfeng Mei wrote: > > Hi, > > I compile the following code with arm gcc 4.6 (x86 is the similar > with one of 4.7 snapshot). > > I noticed "a" is written to memory three times instead of being added > by 3 and written at the > > end. Doesn't restrict guarantee "a" won't be aliased to "p" so 3 > "a++" can be optimized? > > No it does not. Then how do I tell compiler that "a" is not aliased if I have to use global variable? > > > Thanks, > > Bingfeng Mei > > > > int a; > > int P[100]; > > void foo (int * restrict p) > > { > > P[0] = *p; > > a++; > > P[1] = *p; > > a++; > > P[2] = *p; > > a++; > > } > > > > ~/work/install-arm/bin/arm-elf-gcc tst.c -O2 -S -std=c99 > > > > foo: > > @ args = 0, pretend = 0, frame = 0 > > @ frame_needed = 0, uses_anonymous_args = 0 > > @ link register save eliminated. > > ldr r3, .L2 > > ldr r1, [r3, #0] > > ldr ip, [r0, #0] > > ldr r2, .L2+4 > > str r4, [sp, #-4]! > > add r4, r1, #1 > > str r4, [r3, #0] > > str ip, [r2, #0] > > ldr ip, [r0, #0] > > add r4, r1, #2 > > str r4, [r3, #0] > > str ip, [r2, #4] > > ldr r0, [r0, #0] > > add r1, r1, #3 > > str r0, [r2, #8] > > str r1, [r3, #0] > > ldmfd sp!, {r4} > > bx lr > > > >
Derive more alias information from named address space
Hi, I am trying to implement named address space for our target. In alias.c, I found the following piece of code several times. /* If we have MEMs refering to different address spaces (which can potentially overlap), we cannot easily tell from the addresses whether the references overlap. */ if (MEM_ADDR_SPACE (mem) != MEM_ADDR_SPACE (x)) return 1; I think we can do better with the existing target hook: - Target Hook: bool TARGET_ADDR_SPACE_SUBSET_P (addr_space_t superset, addr_space_t subset) If A is not subset of B and B is not subset of A, we can conclude they are either disjoint or overlapped. According to standard draft (section 3.1.3), "For any two address spaces, either the address spaces must be disjoint, they must be equivalent, or one must be a subset of the other. Other forms of overlapping are not permitted." Therefore, A & B could only be disjoint, i.e., not aliased to each other. We should be able to write: if (MEM_ADDR_SPACE (mem) != MEM_ADDR_SPACE (x)) { if (!targetm.addr_space.subset_p (MEM_ADDR_SPACE (mem), MEM_ADDR_SPACE (x)) && !targetm.addr_space.subset_p (MEM_ADDR_SPACE (x), MEM_ADDR_SPACE (mem))) return 0; else return 1; } Is this correct? Thanks, Bingfeng Mei
RE: Derive more alias information from named address space
Thanks. I will prepare a patch. Bingfeng > -Original Message- > From: Ulrich Weigand [mailto:uweig...@de.ibm.com] > Sent: 19 September 2011 12:56 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: Derive more alias information from named address space > > Bingfeng Mei wrote: > > > Therefore, A & B could only be disjoint, i.e., not aliased to each > other. > > We should be able to write: > > > > if (MEM_ADDR_SPACE (mem) != MEM_ADDR_SPACE (x)) > > { > > if (!targetm.addr_space.subset_p (MEM_ADDR_SPACE (mem), > MEM_ADDR_SPACE (x)) > >&& !targetm.addr_space.subset_p (MEM_ADDR_SPACE (x), > MEM_ADDR_SPACE (mem))) > > return 0; > > else > > return 1; > > } > > > > Is this correct? > > Yes, this looks correct to me ... > > Bye, > Ulrich > > -- > Dr. Ulrich Weigand > GNU Toolchain for Linux on System z and Cell BE > ulrich.weig...@de.ibm.com
Wrong documentation of TARGET_ADDR_SPACE_SUBSET_P
Hi, I notice the following description is different from how spu & m32c use it. In internal manual: bool TARGET_ADDR_SPACE_SUBSET_P (addr space t superset, [Target Hook] addr space t subset) Define this to return whether the subset named address space is contained within the superset named address space. Pointers to a named address space that is a subset of another named address space will be converted automatically without a cast if used together in arithmetic operations. Pointers to a superset address space can be converted to pointers to a subset address space via explicit casts. In spu & m32c ports: m32c_addr_space_subset_p (addr_space_t subset, addr_space_t superset) spu_addr_space_subset_p (addr_space_t subset, addr_space_t superset) I believe the document is wrong. The first argument is subset and the second one is superset. Should I submit a patch? Cheers, Bingfeng Mei
Not conform to c90?
Hello, According to http://gcc.gnu.org/onlinedocs/gcc-4.6.1/gcc/Zero-Length.html#Zero-Length A zero-length array should have a length of 1 in c90. But I tried struct { char a[0]; } ZERO; void main() { int a[0]; printf ("size = %d\n", sizeof(ZERO)); } Compiled with gcc 4.7 ~/work/install-x86/bin/gcc test.c -O2 -std=c90 size = 0 I noticed the following statement in GCC document. "As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero." Does it mean GCC just does not conform to c90 in this respect? Thanks, Bingfeng Mei
RE: Not conform to c90?
Thank you very much. I misunderstood the document. Bingfeng > -Original Message- > From: Jonathan Wakely [mailto:jwakely@gmail.com] > Sent: 04 October 2011 12:48 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: Not conform to c90? > > On 4 October 2011 12:09, Bingfeng Mei wrote: > > Hello, > > According to http://gcc.gnu.org/onlinedocs/gcc-4.6.1/gcc/Zero- > Length.html#Zero-Length > > A zero-length array should have a length of 1 in c90. > > I think you've misunderstood that page. You cannot have a zero-length > array in C90, what that page says is that in strict C90 you would have > to create an array of length 1 as a workaround. It's not saying > sizeof(char[0]) is 1. > > GNU C an C99 allow you to have a zero-length array. > > > But I tried > > > > struct > > { > > char a[0]; > > } ZERO; > > > > void main() > > { > > int a[0]; > > printf ("size = %d\n", sizeof(ZERO)); > > } > > > > Compiled with gcc 4.7 > > ~/work/install-x86/bin/gcc test.c -O2 -std=c90 > > > > size = 0 > > If you add -pedantic you'll discover that program isn't valid in C90. > > > I noticed the following statement in GCC document. > > "As a quirk of the original implementation of zero-length arrays, > > sizeof evaluates to zero." > > > > Does it mean GCC just does not conform to c90 in this respect? > > C90 doesn't allow zero length arrays, so you're trying to evaluate a > GNU extension in terms of a standard. I'm not sure what you expect to > happen.
RE: Porting 64-bit target on 32-bit host
I believe that 64-bit target on 32-bit host is not supported by GCC. You need a lot of hackings to do so. Check this thread. http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00908.html Bingfeng Mei > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of > Huang Ping > Sent: 10 October 2011 11:29 > To: gcc@gcc.gnu.org > Subject: Porting 64-bit target on 32-bit host > > Hi, all > > I'm porting a 64-bit target gcc on a 32-bit i386 host. I have set > need_64bit_hwint to yes in config.gcc. But it fails when building > libgcc. > Then I did a simple test. test case like this: > int test () > { > return 0; > } > > I use cc1 compile it with -fdump-tree-all. The 003t.orioginal dump file > shows: > { > return 1900544; > } > > I guess the compiler may take constant 0 as TImode, and read the > adjacent word in memory. But I'm not sure. Could someone give some > advice? > Thanks. > > Ping
RE: Porting 64-bit target on 32-bit host
Well, I just switched to 64-bit host and everything is fine. Bingfeng > -Original Message- > From: harder...@gmail.com [mailto:harder...@gmail.com] On Behalf Of > Huang Ping > Sent: 10 October 2011 16:55 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: Porting 64-bit target on 32-bit host > > 2011/10/10 Bingfeng Mei : > > I believe that 64-bit target on 32-bit host is not supported by GCC. > > You need a lot of hackings to do so. > > > > Check this thread. > > http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00908.html > > Then how did you solve your problem in this thread? > do many hackings on 32-bit host or change to 64-bit host?
Why doesn't GCC generate conditional move for COND_EXPR?
Hello, I noticed that COND_EXPR is not expanded to conditional move as MIN_EXPR/MAX_EXPR are (assuming movmodecc is available). I wonder why not? I have some loop that fails tree vectorization, but still contains COND_EXPR from tree ifcvt pass. In the end, the generated code is worse than if I don't turned -ftree-vectorize on. This is on our private port. Thanks, Bingfeng Mei
RE: Why doesn't GCC generate conditional move for COND_EXPR?
Thanks, Andrew. I also implemented a quick patch on our port (based on GCC 4.5). I noticed it produced better code now for our applications. Maybe eliminating control flow in earlier stage helps other optimizing passes. Currently, tree if-conversion pass is not turned on by default (only with tree vectorization or some other passes). Maybe it is worth to make it default at -O2 (for those processors support conditional move)? Cheers, Bingfeng > -Original Message- > From: Andrew Pinski [mailto:pins...@gmail.com] > Sent: 24 October 2011 17:20 > To: Richard Guenther > Cc: Bingfeng Mei; gcc@gcc.gnu.org > Subject: Re: Why doesn't GCC generate conditional move for COND_EXPR? > > On Mon, Oct 24, 2011 at 7:00 AM, Richard Guenther > wrote: > > On Mon, Oct 24, 2011 at 2:55 PM, Bingfeng Mei > wrote: > >> Hello, > >> I noticed that COND_EXPR is not expanded to conditional move > >> as MIN_EXPR/MAX_EXPR are (assuming movmodecc is available). > >> I wonder why not? > >> > >> I have some loop that fails tree vectorization, but still contains > >> COND_EXPR from tree ifcvt pass. In the end, the generated code > >> is worse than if I don't turned -ftree-vectorize on. This > >> is on our private port. > > > > Because nobody touched COND_EXPR expansion since ages. > > I have a patch which I will be submitting next week or so that does > this expansion correctly. In fact I have a few patches which improves > the generation of COND_EXPR in simple cases (in PHI-OPT). > > Thanks, > Andrew Pinski
SLP vectorizer on non-loop?
Hello, I have one example with two very similar loops. cunrolli pass unrolls one loop completely but not the other based on slightly different cost estimations. The not-unrolled loop get SLP-vectorized, then unrolled by "cunroll" pass, whereas the other unrolled loop cannot be vectorized since it is not a loop any more. In the end, there is big difference of performance between two loops. My question is why SLP vectorization has to be performed on loop (it is a sub-pass under pass_tree_loop). Conceptually, cannot it be done on any basic block? Our port are still stuck at 4.5. But I checked 4.7, it seems still the same. I also checked functions in tree-vect-slp.c. They use a lot of loop_vinfo structures. But in some places it checks whether loop_vinfo exists to use it or other alternative. I tried to add an extra SLP pass after pass_tree_loop, but it didn't work. I wonder how easy to make SLP works for non-loop. Thanks, Bingfeng Mei Broadcom UK void foo (int *__restrict__ temp_hist_buffer, int * __restrict__ p_hist_buff, int *__restrict__ p_input) { int i; for(i=0;i<4;i++) temp_hist_buffer[i]=p_hist_buff[i]; for(i=0;i<4;i++) temp_hist_buffer[i+4]=p_input[i]; }
RE: SLP vectorizer on non-loop?
Ira, Thank you very much for quick answer. I will check 4.7 x86-64 to see difference from our port. Is there significant change between 4.5 & 4.7 regarding SLP? Cheers, Bingfeng > -Original Message- > From: Ira Rosen [mailto:i...@il.ibm.com] > Sent: 01 November 2011 11:13 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: SLP vectorizer on non-loop? > > > > gcc-ow...@gcc.gnu.org wrote on 01/11/2011 12:41:32 PM: > > > Hello, > > I have one example with two very similar loops. cunrolli pass > > unrolls one loop completely > > but not the other based on slightly different cost estimations. The > > not-unrolled loop > > get SLP-vectorized, then unrolled by "cunroll" pass, whereas the > > other unrolled loop cannot > > be vectorized since it is not a loop any more. In the end, there is > > big difference of > > performance between two loops. > > > > Here what I see with the current trunk on x86_64 with -O3 (with the two > loops split into different functions): > > The first loop, the one that doesn't get unrolled by cunrolli, gets > loop > vectorized with -fno-vect-cost-model. With the cost model the > vectorization > fails because the number of iterations is not sufficient (the > vectorizer > tries to apply loop peeling in order to align the accesses), the loop > gets > later unrolled by cunroll and the basic block gets vectorized by SLP. > > The second loop, unrolled by cunrolli, also gets vectorized by SLP. > > The *.optimized dumps look similar: > > > : > vect_var_.14_48 = MEM[(int *)p_hist_buff_9(D)]; > MEM[(int *)temp_hist_buffer_5(D)] = vect_var_.14_48; > return; > > > : > vect_var_.7_57 = MEM[(int *)p_input_10(D)]; > MEM[(int *)temp_hist_buffer_6(D) + 16B] = vect_var_.7_57; > return; > > > > My question is why SLP vectorization has to be performed on loop (it > > is a sub-pass under > > pass_tree_loop). Conceptually, cannot it be done on any basic block? > > Our port are still > > stuck at 4.5. But I checked 4.7, it seems still the same. I also > > checked functions in > > tree-vect-slp.c. They use a lot of loop_vinfo structures. But in > > some places it checks > > whether loop_vinfo exists to use it or other alternative. I tried to > > add an extra SLP > > pass after pass_tree_loop, but it didn't work. I wonder how easy to > > make SLP works for > > non-loop. > > SLP vectorization works both on loops (in vectorize pass) and on basic > blocks (in slp-vectorize pass). > > Ira > > > > > Thanks, > > Bingfeng Mei > > > > Broadcom UK > > > > void foo (int *__restrict__ temp_hist_buffer, > > int * __restrict__ p_hist_buff, > > int *__restrict__ p_input) > > { > > int i; > > for(i=0;i<4;i++) > > temp_hist_buffer[i]=p_hist_buff[i]; > > > > for(i=0;i<4;i++) > > temp_hist_buffer[i+4]=p_input[i]; > > > > } > > > > >
Bug in Tree to RTL expansion?
Hi, I experienced a code generation bug with 4.5 (yes, our port is still stuck at 4.5.4). Since the concerned code is full of our target-specific code, it is not easy to demonstrate the error with x86 or ARM. Here is what happens in expanding process. The following is a piece of optimized tree code to be expanded to RTL. # ptr_h2_493 = PHI ... D.13598_218 = MEM[base: ptr_h2_493, offset: 8]; D.13599_219 = (long int) D.13598_218; ... ptr_h2_310 = ptr_h2_493 + 16; ... D.13634_331 = D.13599_219 * D.13538_179; cor3_332 = D.13635_339 + D.13634_331; ... When expanding to RTL, the coalescing algorithm will coalesce ptr_h2_310 & ptr_h2_493 to one register: ;; ptr_h2_310 = ptr_h2_493 + 16; (insn 364 363 0 (set (reg/v/f:SI 282 [ ptr_h2 ]) (plus:SI (reg/v/f:SI 282 [ ptr_h2 ]) (const_int 16 [0x10]))) -1 (nil)) GCC 4.5 (fp_gcc 2.3.x) doesn't expand statements one-by-one as GCC 4.4 (fp_gcc 2.2.x) does. So when GCC expands the following statement, cor3_332 = D.13635_339 + D.13634_331; it then in turn expands each operand by going back to expand previous relevant statements. D.13598_218 = MEM[base: ptr_h2_493, offset: 8]; D.13599_219 = (long int) D.13598_218; ... D.13634_331 = D.13599_219 * D.13538_179; The problem is that compiler doesn't take account into fact that ptr_h2_493|ptr_h2_310 has been modified. Still expand the above statement as it is. (insn 380 379 381 (set (reg:HI 558) (mem:HI (plus:SI (reg/v/f:SI 282 [ ptr_h2 ]) (const_int 8 [0x8])) [0 S2 A8])) -1 (nil)) ... (insn 382 381 383 (set (reg:SI 557) (mult:SI (sign_extend:SI (reg:HI 558)) (sign_extend:SI (reg:HI 559 -1 (nil)) This seems to me quite a basic issue. I cannot believe testsuites and other applications do not expose more errors. What I am not sure is whether the coalescing algorithm or the expanding procedure is wrong here. If ptr_h2_493 and ptr_h2_310 are not coalesced to use the same register, it should be correctly compiled. Or expanding procedure checks data flow, it should be also OK. Which one should I I look at? Or is this a known issue and fixed in 4.6/4.7? Thanks, Bingfeng Mei
RE: Bug in Tree to RTL expansion?
Richard, Thanks. -fno-tree-ter does work around the problem. I did look at the info about coalescing, which doesn't give much info. I think I have to take a closer look at TER and coalescing algorithm. Regards, Bingfeng > -Original Message- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: 08 December 2011 12:10 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org; Michael Matz > Subject: Re: Bug in Tree to RTL expansion? > > On Thu, Dec 8, 2011 at 12:34 PM, Bingfeng Mei wrote: > > Hi, > > I experienced a code generation bug with 4.5 (yes, our > > port is still stuck at 4.5.4). Since the concerned code > > is full of our target-specific code, it is not easy > > to demonstrate the error with x86 or ARM. > > > > Here is what happens in expanding process. The following is a > > piece of optimized tree code to be expanded to RTL. > > > > # ptr_h2_493 = PHI > > ... > > D.13598_218 = MEM[base: ptr_h2_493, offset: 8]; > > D.13599_219 = (long int) D.13598_218; > > ... > > ptr_h2_310 = ptr_h2_493 + 16; > > ... > > D.13634_331 = D.13599_219 * D.13538_179; > > cor3_332 = D.13635_339 + D.13634_331; > > ... > > > > When expanding to RTL, the coalescing algorithm will coalesce > > ptr_h2_310 & ptr_h2_493 to one register: > > > > ;; ptr_h2_310 = ptr_h2_493 + 16; > > (insn 364 363 0 (set (reg/v/f:SI 282 [ ptr_h2 ]) > > (plus:SI (reg/v/f:SI 282 [ ptr_h2 ]) > > (const_int 16 [0x10]))) -1 (nil)) > > > > GCC 4.5 (fp_gcc 2.3.x) doesn't expand statements one-by-one > > as GCC 4.4 (fp_gcc 2.2.x) does. So when GCC expands the > > following statement, > > > > cor3_332 = D.13635_339 + D.13634_331; > > > > it then in turn expands each operand by going back to > > expand previous relevant statements. > > > > D.13598_218 = MEM[base: ptr_h2_493, offset: 8]; > > D.13599_219 = (long int) D.13598_218; > > ... > > D.13634_331 = D.13599_219 * D.13538_179; > > > > The problem is that compiler doesn't take account into fact that > > ptr_h2_493|ptr_h2_310 has been modified. Still expand the above > > statement as it is. > > > > (insn 380 379 381 (set (reg:HI 558) > > (mem:HI (plus:SI (reg/v/f:SI 282 [ ptr_h2 ]) > > (const_int 8 [0x8])) [0 S2 A8])) -1 (nil)) > > ... > > (insn 382 381 383 (set (reg:SI 557) > > (mult:SI (sign_extend:SI (reg:HI 558)) > > (sign_extend:SI (reg:HI 559 -1 (nil)) > > > > This seems to me quite a basic issue. I cannot believe testsuites > > and other applications do not expose more errors. > > > > What I am not sure is whether the coalescing algorithm or the > expanding > > procedure is wrong here. If ptr_h2_493 and ptr_h2_310 are not > coalesced > > to use the same register, it should be correctly compiled. Or > expanding > > procedure checks data flow, it should be also OK. Which one should I > > I look at? Or is this a known issue and fixed in 4.6/4.7? > > TER should not happen for D.13598_218 = MEM[base: ptr_h2_493, offset: > 8]; because it conflicts with the coalesce. Thus, -fno-tree-ter > should > fix your issue. You may look at the -fdump-rtl-expand-details dump > to learn about the coalescing decisions. > > I'm not sure we fixed a bug that looks like the above. With 4.5 > the 'MEM' is a TARGET_MEM_REF tree. > > Micha should be most familiar with evolutions in this code. > > Richard. > > > Thanks, > > Bingfeng Mei > >
RE: Bug in Tree to RTL expansion?
Michael, Thanks for your help. I struggled to understand tree-ssa-ter.c. Please see questions below. I also tried the tree-ssa-ter.c from the trunk. Same results. Bingfeng > -Original Message- > From: Michael Matz [mailto:m...@suse.de] > Sent: 08 December 2011 13:50 > To: Richard Guenther > Cc: Bingfeng Mei; gcc@gcc.gnu.org > Subject: Re: Bug in Tree to RTL expansion? > > Hi, > > On Thu, 8 Dec 2011, Richard Guenther wrote: > > > > What I am not sure is whether the coalescing algorithm or the > > > expanding procedure is wrong here. > > The forwarding of _218 is wrong. TER shouldn't have marked it as being > able to forward (check the expand-detailed dump for the "Replacing > Expressions" section). Obviously it does think it can forward it, so > something is wrong in tree-ssa-ter.c. > > If you can't come up with a testcase that fails with some available > cross > compiler (though I wonder why, the tree-ssa parts of the compiler are > not > that target dependend, so maybe you can show similar forwarding with an > adjusted testcase for something publically available) you have to debug > it > yourself (I'm right now not aware of any known bug in 4.5 regarding > this > particular situation). > > There should be a call to kill_expr on the statement > ptr_h2_310 = ptr_h2_493 + 16; I tracked into how TER is executed. kill_expr is called but the kill_list are already all empty because mark_replaceable -> finished_with_expr clear all the kill_list. In addition, once replaceable_expressions is set by mark_replaceable. It doesn't seem it is ever cleared due to kill_expr or any other function. replaceable_expression is the only data structure passed to expand pass. > which should have killed the expression MEM[ptr_h2_493] (and hence _218) > from the available expressions. > > > Ciao, > Michael.
RE: Bug in Tree to RTL expansion?
OK, don't bother. I think I understand TER and my issue now. It is from a misfix of widening multiplication, which I found there is a new pass doing this from 4.6. I am going to back port that to my target. Thanks, Bingfeng > -Original Message- > From: Bingfeng Mei > Sent: 09 December 2011 14:34 > To: 'Michael Matz'; Richard Guenther > Cc: gcc@gcc.gnu.org > Subject: RE: Bug in Tree to RTL expansion? > > Michael, > Thanks for your help. I struggled to understand tree-ssa-ter.c. > Please see questions below. > > I also tried the tree-ssa-ter.c from the trunk. Same results. > > Bingfeng > > > -Original Message- > > From: Michael Matz [mailto:m...@suse.de] > > Sent: 08 December 2011 13:50 > > To: Richard Guenther > > Cc: Bingfeng Mei; gcc@gcc.gnu.org > > Subject: Re: Bug in Tree to RTL expansion? > > > > Hi, > > > > On Thu, 8 Dec 2011, Richard Guenther wrote: > > > > > > What I am not sure is whether the coalescing algorithm or the > > > > expanding procedure is wrong here. > > > > The forwarding of _218 is wrong. TER shouldn't have marked it as > being > > able to forward (check the expand-detailed dump for the "Replacing > > Expressions" section). Obviously it does think it can forward it, so > > something is wrong in tree-ssa-ter.c. > > > > If you can't come up with a testcase that fails with some available > > cross > > compiler (though I wonder why, the tree-ssa parts of the compiler are > > not > > that target dependend, so maybe you can show similar forwarding with > an > > adjusted testcase for something publically available) you have to > debug > > it > > yourself (I'm right now not aware of any known bug in 4.5 regarding > > this > > particular situation). > > > > There should be a call to kill_expr on the statement > > ptr_h2_310 = ptr_h2_493 + 16; > > I tracked into how TER is executed. > kill_expr is called but the kill_list are already all empty because > > mark_replaceable -> finished_with_expr clear all the kill_list. > > In addition, once replaceable_expressions is set by mark_replaceable. > It doesn't seem > it is ever cleared due to kill_expr or any other function. > replaceable_expression > is the only data structure passed to expand pass. > > > > which should have killed the expression MEM[ptr_h2_493] (and hence > _218) > > from the available expressions. > > > > > > Ciao, > > Michael.
libtool error in building GCC
Hello, I am experiencing the following error when building TRUNK version of our port. I am not familar with libtool. In 4.4, GCC produces its own libtools under libstdc++v3 directory and other similar directories. But I cannot track how the libtool is generated. Even I remove libtool under libstdc++-v3 directory and rerun make and it cannot regenerate libtool again. Examining config.log, config.status and Makefile doesn't help me either. So I really get lost what is going wrong in 4.5 trunk. Any help is greatly appreciated. Thanks, Bingfeng Mei /bin/sh ../libtool --tag CXX --tag disable-shared --mode=compile /projects/firepath/tools/work/bmei/gcc-head/build/./gcc/xgcc -shared-libgcc -B/projects/firepath/tools/work/bmei/gcc-head/build/./gcc -nostdinc++ -L/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/libstdc++-v3/src -L/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/libstdc++-v3/src/.libs -nostdinc -B/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/newlib/ -isystem /projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/newlib/targ-include -isystem /projects/firepath/tools/work/bmei/gcc-head/src/newlib/libc/include -B/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/libgloss/firepath -L/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/libgloss/libnosys -L/projects/firepath/tools/work/bmei/gcc-head/src/libgloss/firepath -B/home/bmei/work/gcc-head/install/firepath-elf/bin/ -B/home/bmei/work/gcc-head/install/firepath-elf/lib/ -isystem /home/bmei/work/gcc-head/install/firepath-elf/include -isystem /home/bmei/work/gcc-head/install/firepath-elf/sys-include -I/projects/firepath/tools/work/bmei/gcc-head/src/libstdc++-v3/../gcc -I/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/libstdc++-v3/include/firepath-elf -I/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/libstdc++-v3/include -I/projects/firepath/tools/work/bmei/gcc-head/src/libstdc++-v3/libsupc++ -fno-implicit-templates -Wall -Wextra -Wwrite-strings -Wcast-qual -fdiagnostics-show-location=once -ffunction-sections -fdata-sections -g -O2 -c -o array_type_info.lo ../../../../src/libstdc++-v3/libsupc++/array_type_info.cc /bin/sh: ../libtool: No such file or directory make[4]: *** [array_type_info.lo] Error 127 make[4]: Leaving directory `/projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf/libstdc++-v3/libsupc++ The following is my configuration command: CC="gcc -m32" CFLAGS="-g" ../src/configure --prefix=/home/bmei/work/gcc-head/install --enable-languages=c,c++ --disable-nls --target=firepath-elf --with-newlib --with-mpfr=/projects/firepath/tools/work/bmei/packages/mpfr/2.4.1 --with-gmp=/projects/firepath/tools/work/bmei/packages/gmp/4.3.0 --disable-libssp --with-headers --enable-checking --enable-multilib
RE: libtool error in building GCC
Just ignore my previous mail. I find the error is because we failed to import the new 4.5 directory libstdc++v3/python to our repository. > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf Of Bingfeng Mei > Sent: 21 July 2009 12:41 > To: gcc@gcc.gnu.org > Subject: libtool error in building GCC > > Hello, > I am experiencing the following error when building TRUNK > version of our port. > I am not familar with libtool. In 4.4, GCC produces its own > libtools under > libstdc++v3 directory and other similar directories. But I > cannot track > how the libtool is generated. Even I remove libtool under > libstdc++-v3 directory > and rerun make and it cannot regenerate libtool again. > Examining config.log, > config.status and Makefile doesn't help me either. So I > really get lost what > is going wrong in 4.5 trunk. Any help is greatly appreciated. > > Thanks, > Bingfeng Mei > > > /bin/sh ../libtool --tag CXX --tag disable-shared > --mode=compile > /projects/firepath/tools/work/bmei/gcc-head/build/./gcc/xgcc > -shared-libgcc > -B/projects/firepath/tools/work/bmei/gcc-head/build/./gcc > -nostdinc++ > -L/projects/firepath/tools/work/bmei/gcc-head/build/firepath-e > lf/libstdc++-v3/src > -L/projects/firepath/tools/work/bmei/gcc-head/build/firepath-e > lf/libstdc++-v3/src/.libs -nostdinc > -B/projects/firepath/tools/work/bmei/gcc-head/build/firepath-e > lf/newlib/ -isystem > /projects/firepath/tools/work/bmei/gcc-head/build/firepath-elf > /newlib/targ-include -isystem > /projects/firepath/tools/work/bmei/gcc-head/src/newlib/libc/in > clude > -B/projects/firepath/tools/work/bmei/gcc-head/build/firepath-e > lf/libgloss/firepath > -L/projects/firepath/tools/work/bmei/gcc-head/build/firepath-e > lf/libgloss/libnosys > -L/projects/firepath/tools/work/bmei/gcc-head/src/libgloss/fir > epath -B/home/bmei/work/gcc-head/install/firepath-elf/bin/ > -B/home/bmei/work/gcc-head/install/firepath-elf/lib/ -isystem > /home/bmei/work/gcc-head/install/firepath-elf/include > -isystem > /home/bmei/work/gcc-head/install/firepath-elf/sys-include > -I/projects/firepath/tools/work/bmei/gcc-head/src/libstdc++-v3 > /../gcc > -I/projects/firepath/tools/work/bmei/gcc-head/build/firepath-e > lf/libstdc++-v3/include/firepath-elf > -I/projects/firepath/tools/work/bmei/gcc-head/build/firepath-e > lf/libstdc++-v3/include > -I/projects/firepath/tools/work/bmei/gcc-head/src/libstdc++-v3 > /libsupc++ -fno-implicit-templates -Wall -Wextra > -Wwrite-strings -Wcast-qual -fdiagnostics-show-location=once > -ffunction-sections -fdata-sections -g -O2 -c -o > array_type_info.lo > ../../../../src/libstdc++-v3/libsupc++/array_type_info.cc > /bin/sh: ../libtool: No such file or directory > make[4]: *** [array_type_info.lo] Error 127 > make[4]: Leaving directory > `/projects/firepath/tools/work/bmei/gcc-head/build/firepath-el > f/libstdc++-v3/libsupc++ > > > The following is my configuration command: > > CC="gcc -m32" CFLAGS="-g" ../src/configure > --prefix=/home/bmei/work/gcc-head/install > --enable-languages=c,c++ --disable-nls --target=firepath-elf > --with-newlib > --with-mpfr=/projects/firepath/tools/work/bmei/packages/mpfr/2 > .4.1 > --with-gmp=/projects/firepath/tools/work/bmei/packages/gmp/4.3 > .0 --disable-libssp --with-headers --enable-checking > --enable-multilib > > >
RE: PRE_DEC, POST_INC
I asked similar question regarding PRE_INC/POST_INC quite a while ago, and there were quite some discussions. Haven't check whether the situation changed. http://gcc.gnu.org/ml/gcc/2007-11/msg00032.html Bingfeng Mei > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf Of Ramana Radhakrishnan > Sent: 07 August 2009 14:11 > To: Florent Defay > Cc: gcc@gcc.gnu.org > Subject: Re: PRE_DEC, POST_INC > > On Fri, Aug 7, 2009 at 1:33 PM, Florent > Defay wrote: > > Hi, > > > > I am working on a new port. > > > > The target machine supports post-increment and pre-decrement > > addressing modes. These modes are twice faster than indexed mode. > > It is important for us that GCC consider them well. > > > GCC does support generation of pre{post}-inc {dec}. GCC's auto-inc > detector works at a basic block level and attempts to generate > auto-inc operations within a basic block . Look at auto-inc-dec.c for > more information.It is an area which could do with some improvement > and work , however no one's found the time to do it yet. > > > > > I wrote emails to gcc-help and I was told that GCC was not > so good at > > pre/post-dec/increment since few targets support these modes. > > > > I would like to know if there is a good way to make pre-dec and > > post-inc modes have more priority than indexed mode. > > Is there current work for dealing with this issue ? > > I am assuming you already have basic generation of auto-incs and you > have your definitions for legitimate{legitimize}_address all set up > correctly. > > In this case you could start by tweaking your address costs macros. > Getting that right can help you get going in the right direction with > the current state of the art. In a previous life while maintaining a > private port of GCC I've dabbled with a few patches posted by Joern > Reneccke with regards to auto-inc-dec that worked well for me in > improving code generation on some of the benchmarks. You should be > able to get them out using Google. > > There are a number of bugzilla entries in the database that cover a > number of cases for auto-inc generation and some ideas on what can be > done to improve this. You might be better off searching in that as > well. One of the problems upto 4.3 was that the ivopts and the loop > optimizers didn't care too much about auto-increment addressing and > thereby pessimizing this in favour of using index addressing. There > have been a few patches that were being discussed in in the recent > past by Bernd Schmidt and Zdenek attempting to address auto-inc > generation for loop ivopts but I'm not sure if these have gone into > trunk yet. > > Hope this helps. > > > cheers > Ramana > >
RE: IRA undoing scheduling decisions
I can comfirm too in our private port, though in slightly different form. r2 = 7 [r0] = r2 r0 = 4 [r1] = r0 Bingfeng > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf Of Charles J. Tabony > Sent: 25 August 2009 00:56 > To: gcc@gcc.gnu.org > Subject: IRA undoing scheduling decisions > > Fellow GCC developers, > > I am seeing a performance regression on the port I maintain, > and I would appreciate some pointers. > > When I compile the following code > > void f(int *x, int *y){ > *x = 7; > *y = 4; > } > > with GCC 4.3.2, I get the desired sequence of instructions. > I'll call it sequence A: > > r0 = 7 > r1 = 4 > [x] = r0 > [y] = r1 > > When I compile the same code with GCC 4.4.0, I get a sequence > that is lower performance for my target machine. I'll call > it sequence B: > > r0 = 7 > [x] = r0 > r0 = 4 > [y] = r0 > > I see the same difference between GCC 4.3.2 and 4.4.0 when > compiling for PowerPC, MIPS, ARM, and FR-V. > > When I look at the RTL dumps, I see that the first scheduling > pass always produces sequence A, across all targets and GCC > versions I tried. In GCC 4.3.2, sequence A persists > throughout the remainder of compilation. In GCC 4.4.0, for > every target, the .ira dump shows that the sequence of > instructions has reverted back to sequence B. > > Are there any machine-dependent parameters that I can tune to > prevent IRA from transforming sequence A into sequence B? If > not, where can I add a hook to allow this decision to be > tuned per machine? > > Is there any more information you would like me to provide? > > Thank you, > Charles J. Tabony > > >
Restrict qualifier doesn't work in TRUNK.
Hello, I notice the restrict qualifier doesn't work properly in trunk any more. In the following example, the memory accesses of a, b, c don't have different alias set attached any more. Instead, the generic alias set of 2 is used for all accesses. I remember alias analysis part had some big changes since 4.5 branched out. Is it still in unstable phase, or is there some new hooks I should use for my port? Cheers, Bingfeng Mei void foo (int * __restrict__ a, int * __restrict__ b, int * __restrict__ c) { int i; for(i = 0; i < 100; i+=4) { a[i] = b[i] * c[i]; a[i+1] = b[i+1] * c[i+1]; a[i+2] = b[i+2] * c[i+2]; a[i+3] = b[i+3] * c[i+3]; } } Before expand: foo (int * restrict a, int * restrict b, int * restrict c) { long unsigned int D.3213; long unsigned int ivtmp.32; long unsigned int ivtmp.31; long unsigned int ivtmp.29; int D.3168; int D.3167; int D.3165; int D.3160; int D.3159; int D.3157; int D.3152; int D.3151; int D.3149; int D.3143; int D.3142; int D.3140; # BLOCK 2 freq:385 # PRED: ENTRY [100.0%] (fallthru,exec) ivtmp.29_77 = (long unsigned int) b_9(D); ivtmp.31_74 = (long unsigned int) c_14(D); ivtmp.32_83 = (long unsigned int) a_5(D); D.3213_85 = ivtmp.32_83 + 400; # SUCC: 3 [100.0%] (fallthru,exec) # BLOCK 3 freq:9615 # PRED: 3 [96.0%] (true,exec) 2 [100.0%] (fallthru,exec) # ivtmp.29_79 = PHI # ivtmp.31_76 = PHI # ivtmp.32_73 = PHI D.3140_11 = MEM[index: ivtmp.29_79]; D.3142_16 = MEM[index: ivtmp.31_76]; D.3143_17 = D.3142_16 * D.3140_11; MEM[index: ivtmp.32_73] = D.3143_17; D.3149_26 = MEM[index: ivtmp.29_79, offset: 4]; D.3151_31 = MEM[index: ivtmp.31_76, offset: 4]; D.3152_32 = D.3151_31 * D.3149_26; MEM[index: ivtmp.32_73, offset: 4] = D.3152_32; D.3157_41 = MEM[index: ivtmp.29_79, offset: 8]; D.3159_46 = MEM[index: ivtmp.31_76, offset: 8]; D.3160_47 = D.3159_46 * D.3157_41; MEM[index: ivtmp.32_73, offset: 8] = D.3160_47; D.3165_56 = MEM[index: ivtmp.29_79, offset: 12]; D.3167_61 = MEM[index: ivtmp.31_76, offset: 12]; D.3168_62 = D.3167_61 * D.3165_56; MEM[index: ivtmp.32_73, offset: 12] = D.3168_62; ivtmp.29_78 = ivtmp.29_79 + 16; ivtmp.31_75 = ivtmp.31_76 + 16; ivtmp.32_82 = ivtmp.32_73 + 16; if (ivtmp.32_82 != D.3213_85) goto ; else goto ; # SUCC: 3 [96.0%] (true,exec) 4 [4.0%] (false,exec) # BLOCK 4 freq:385 # PRED: 3 [4.0%] (false,exec) return; # SUCC: EXIT [100.0%] } Part of RTL ... insn 40 39 41 4 sms-6.c:11 (set (reg:SI 157) (mem:SI (reg:SI 151 [ ivtmp.31 ]) [2 S4 A32])) -1 (nil)) (insn 41 40 42 4 sms-6.c:11 (set (reg:SI 158) (mem:SI (reg:SI 152 [ ivtmp.29 ]) [2 S4 A32])) -1 (nil)) (insn 42 41 43 4 sms-6.c:11 (set (reg:SI 159) (mult:SI (reg:SI 157) (reg:SI 158))) -1 (nil)) (insn 43 42 44 4 sms-6.c:11 (set (mem:SI (reg:SI 150 [ ivtmp.32 ]) [2 S4 A32]) (reg:SI 159)) -1 (nil)) (insn 44 43 45 4 sms-6.c:12 (set (reg:SI 160) (mem:SI (plus:SI (reg:SI 151 [ ivtmp.31 ]) (const_int 4 [0x4])) [2 S4 A32])) -1 (nil)) (insn 45 44 46 4 sms-6.c:12 (set (reg:SI 161) (mem:SI (plus:SI (reg:SI 152 [ ivtmp.29 ]) (const_int 4 [0x4])) [2 S4 A32])) -1 (nil)) (insn 46 45 47 4 sms-6.c:12 (set (reg:SI 162) (mult:SI (reg:SI 160) (reg:SI 161))) -1 (nil)) (insn 47 46 48 4 sms-6.c:12 (set (mem:SI (plus:SI (reg:SI 150 [ ivtmp.32 ]) (const_int 4 [0x4])) [2 S4 A32]) (reg:SI 162)) -1 (nil)) (insn 48 47 49 4 sms-6.c:13 (set (reg:SI 163) (mem:SI (plus:SI (reg:SI 151 [ ivtmp.31 ]) (const_int 8 [0x8])) [2 S4 A32])) -1 (nil)) (insn 49 48 50 4 sms-6.c:13 (set (reg:SI 164) (mem:SI (plus:SI (reg:SI 152 [ ivtmp.29 ]) (const_int 8 [0x8])) [2 S4 A32])) -1 (nil)) ...
RE: help on - how to specify architecture information to gcc
You should check how to construct DFA for your target architecture. Look at "Specifying processor pipeline description" in GCC internal manual and checked out how other architectures do it. -Bingfeng > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf Of ddmetro > Sent: 21 September 2009 12:52 > To: gcc@gcc.gnu.org > Subject: help on - how to specify architecture information to gcc > > > Hi All, > Our project is to optimize instruction scheduling in gcc. It > requires us to specify architecture information > (basically number of cycles per instruction, stall and branch delays) > to gcc, to optimize structural hazard detection. > > Problem: Is there any specific format in which we can specify this > information to gcc? Is it possible to embed this additional > architecture specific detail, in .md files? > > Target language for which optimization is being done: C > Target machine architecture: i686 > GCC version: 4.4.1 > > If none of the above options work, we were planning to put > the information manually in a file and make gcc read it each time it > loads. Any suggestions/comments on this approach? > > Couldn't find a related thread. Hence a new one. > > Thanking All, > - Dhiraj. > -- > View this message in context: > http://www.nabble.com/help-on---how-to-specify-architecture-in > formation-to-gcc-tp25530300p25530300.html > Sent from the gcc - Dev mailing list archive at Nabble.com. > > >
Issues of the latest trunk with LTO merges
Hello, I ran into an issue with the LTO merges when updating to current trunk. The problem is that my target calls a few functions/uses some data structures in the gcc directory: c_language, paragma_lex, c_register_pragma, etc. gcc -m32 -g -DIN_GCC -DCROSS_DIRECTORY_STRUCTURE -W -Wall -Wwrite-strings -Wcast-qual -Wstrict-prototypes -Wmissing-prototypes -Wmissing-format-attribute -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Wold-style-definition -Wc++-compat -fno-common -DHAVE_CONFIG_H -o lto1 \ lto/lto-lang.o lto/lto.o lto/lto-elf.o attribs.o main.o tree-browser.o libbackend.a ../libcpp/libcpp.a ../libdecnumber/libdecnumber.a -L/projects/firepath/tools/work/bmei/packages/gmp/4.3.0/lib -L/projects/firepath/tools/work/bmei/packages/mpfr/2.4.1/lib -lmpfr -lgmp -rdynamic -ldl -L../zlib -lz -L/projects/firepath/tools/work/bmei/packages/libelf/lib -lelf ../libcpp/libcpp.a ../libiberty/libiberty.a ../libdecnumber/libdecnumber.a -lelf When compiling for lto1 in above step, I have many linking errors consequently. I tried to add some extra object files like c-common.o, c-pragma.o, etc into lto/Make-lang.in. More linking errors are produced. One problem is that lto code redefines some data exist in the main code: flag_no_builtin, flag_isoc99 lang_hooks, etc, which prevent it from linking with object files in main directory. What is the clean solution for this? Thanks in advance. Cheers, Bingfeng Mei
RE: Issues of the latest trunk with LTO merges
Richard, Doesn't REGISTER_TARGET_PRAGMAS need to call c_register_pragma, etc, if we want to specify target-specific pragma? It becomes part of libbackend.a, which is linked to lto1. One solution I see is to put them into a separate file so the linker won't produce undefined references when they are not actually used by lto1. Thanks, Bingfeng > -Original Message- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: 12 October 2009 15:34 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: Issues of the latest trunk with LTO merges > > On Mon, Oct 12, 2009 at 4:31 PM, Bingfeng Mei > wrote: > > Hello, > > I ran into an issue with the LTO merges when updating to > current trunk. > > The problem is that my target calls a few functions/uses > some data structures > > in the gcc directory: c_language, paragma_lex, > c_register_pragma, etc. > > > > gcc -m32 -g -DIN_GCC -DCROSS_DIRECTORY_STRUCTURE -W -Wall > -Wwrite-strings -Wcast-qual -Wstrict-prototypes > -Wmissing-prototypes -Wmissing-format-attribute -pedantic > -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings > -Wold-style-definition -Wc++-compat -fno-common > -DHAVE_CONFIG_H -o lto1 \ > > lto/lto-lang.o lto/lto.o lto/lto-elf.o > attribs.o main.o tree-browser.o libbackend.a > ../libcpp/libcpp.a ../libdecnumber/libdecnumber.a > -L/projects/firepath/tools/work/bmei/packages/gmp/4.3.0/lib > -L/projects/firepath/tools/work/bmei/packages/mpfr/2.4.1/lib > -lmpfr -lgmp -rdynamic -ldl -L../zlib -lz > -L/projects/firepath/tools/work/bmei/packages/libelf/lib > -lelf ../libcpp/libcpp.a ../libiberty/libiberty.a > ../libdecnumber/libdecnumber.a -lelf > > > > When compiling for lto1 in above step, I have many linking > errors consequently. > > I tried to add some extra object files like c-common.o, > c-pragma.o, etc into > > lto/Make-lang.in. More linking errors are produced. One > problem is that lto > > code redefines some data exist in the main code: > flag_no_builtin, flag_isoc99 > > lang_hooks, etc, which prevent it from linking with object > files in main directory. > > > > What is the clean solution for this? Thanks in advance. > > You should not use C frontend specific stuff when not building > the C frontend. > > Richard. > > > Cheers, > > Bingfeng Mei > > > > > > > >
LTO question
Hello, I just had the first taste with the latest LTO merge on our port. Compiler is configured with LTO enabled and built correctly. I tried the following example: a.c extern void foo(int); int main() { foo(20); return 1; } b.c #include void foo(int c) { printf("Hello world: %d\n", c); } compiled with: firepath-elf-gcc -flto a.c b.c -save-temps -O2 I expected that foo should be inlined into main, but not. Both functions of main and foo are present in a.s, while b.s contains the LTO code. Did I miss something here? Are there new hooks I should specify or tune for LTO? I checked the up-to-date internal manual and found nothing. Thanks, Bingfeng Mei
RE: LTO question
Thanks. It works. I thought -fwhole-program was used with --combine and they are replaced by -flto. Now it seems that -flto is equivalent of --combine, and -fwhole-program is still important. Bingfeng > -Original Message- > From: Diego Novillo [mailto:dnovi...@google.com] > Sent: 13 October 2009 14:30 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: LTO question > > On Tue, Oct 13, 2009 at 08:47, Bingfeng Mei wrote: > > > a.c > > extern void foo(int); > > int main() > > { foo(20); > > return 1; > > } > > > > b.c > > #include > > void foo(int c) > > { > > printf("Hello world: %d\n", c); > > } > > > > compiled with: > > firepath-elf-gcc -flto a.c b.c -save-temps -O2 > > > > I expected that foo should be inlined into main, but not. > Both functions > > of main and foo are present in a.s, while b.s contains the > LTO code. > > Try adding -fwhole-program. > > > Diego. > >
RE: LTO question
> -Original Message- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: 13 October 2009 16:15 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: LTO question > > On Tue, Oct 13, 2009 at 2:47 PM, Bingfeng Mei > wrote: > > Hello, > > I just had the first taste with the latest LTO merge on our port. > > Compiler is configured with LTO enabled and built correctly. > > I tried the following example: > > > > a.c > > extern void foo(int); > > int main() > > { foo(20); > > return 1; > > } > > > > b.c > > #include > > void foo(int c) > > { > > printf("Hello world: %d\n", c); > > } > > > > compiled with: > > firepath-elf-gcc -flto a.c b.c -save-temps -O2 > > > > I expected that foo should be inlined into main, but not. > Both functions > > of main and foo are present in a.s, while b.s contains the > LTO code. > > > > Did I miss something here? Are there new hooks I should > specify or tune for > > LTO? I checked the up-to-date internal manual and found nothing. > > non-inline declared functions are inlined at -O2 only if doing so > does not increase program size. Use -O3 or -finline-functions. But the function is only called once here. It should always decrease size unless my cost function is terribly wrong. I will check how other targets such as x86 do here. > > Richard. > > > Thanks, > > Bingfeng Mei > > > > > > > >
RE: Turning off unrolling to certain loops
Hello, I faced a similar issue a while ago. I filed a bug report (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36712) In the end, I implemented a simple tree-level unrolling pass in our port which uses all the existing infrastructure. It works quite well for our purpose, but I hesitated to submit the patch because it contains our not-very-elegannt #prgama unroll implementation. The following two functions do the unrolling. I insert the pass just after the loop_prefetch pass (4.4) Cheers, Bingfeng Mei /* Perfect unrolling of a loop */ static void tree_unroll_perfect_loop (struct loop *loop, unsigned factor, edge exit) { sbitmap wont_exit; edge e; bool ok; unsigned i; VEC (edge, heap) *to_remove = NULL; /* Unroll the loop and remove the exits in all iterations except for the last one. */ wont_exit = sbitmap_alloc (factor); sbitmap_ones (wont_exit); RESET_BIT (wont_exit, factor - 1); ok = gimple_duplicate_loop_to_header_edge (loop, loop_latch_edge (loop), factor - 1, wont_exit, exit, &to_remove, DLTHE_FLAG_UPDATE_FREQ); free (wont_exit); gcc_assert (ok); for (i = 0; VEC_iterate (edge, to_remove, i, e); i++) { ok = remove_path (e); gcc_assert (ok); } VEC_free (edge, heap, to_remove); update_ssa (TODO_update_ssa); #ifdef ENABLE_CHECKING verify_flow_info (); verify_dominators (CDI_DOMINATORS); verify_loop_structure (); verify_loop_closed_ssa (); #endif } /* Go through all the loops: 1. Determine unrolling factor 2. Unroll loops in different conditions -- perfect loop: no extra copy of original loop -- other loops: the original version of loops to execute the remaining iterations */ static unsigned int rest_of_tree_unroll (void) { loop_iterator li; struct loop *loop; unsigned ninsns, unroll_factor; HOST_WIDE_INT est_niter; struct tree_niter_desc desc; bool unrolled = false; initialize_original_copy_tables (); /* Scan the loops, inner ones first. */ FOR_EACH_LOOP (li, loop, LI_FROM_INNERMOST) { est_niter = estimated_loop_iterations_int (loop, false); ninsns = tree_num_loop_insns (loop, &eni_size_weights); unroll_factor = determine_unroll_factor (loop, ninsns, &desc, est_niter); if (unroll_factor != 1) { tree niters = number_of_exit_cond_executions(loop); bool perfect_unrolling = false; if(niters != NULL_TREE && niters!= chrec_dont_know && TREE_CODE(niters) == INTEGER_CST){ int num_iters = tree_low_cst(niters, 1); if((num_iters % unroll_factor) == 0) perfect_unrolling = true; } /* If no. of iterations can be divided by unrolling factor, we have perfect unrolling */ if(perfect_unrolling){ tree_unroll_perfect_loop(loop, unroll_factor, single_dom_exit(loop)); } else{ tree_unroll_loop (loop, unroll_factor, single_dom_exit (loop), &desc); } unrolled = true; } } free_original_copy_tables (); /* Need to rebuild call graph due if new function calls are created due to loop unrolling FIXME: rebuild cgraph will lose some information like reason of not inlining*/ if(unrolled) rebuild_cgraph_edges(); /*debug_cgraph();*/ return 0; } > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf Of Jean Christophe Beyler > Sent: 14 October 2009 19:29 > To: Zdenek Dvorak > Cc: gcc@gcc.gnu.org > Subject: Re: Turning off unrolling to certain loops > > Ok, I've actually gone a different route. Instead of waiting for the > middle end to perform this, I've directly modified the parser stage to > unroll the loop directly there. > > Basically, I take the parser of the for and modify how it adds the > various statements. Telling it to, instead of doing in the > c_finish_loop : > > if (body) > add_stmt (body); > if (clab) > add_stmt (build1 (LABEL_EXPR, void_type_node, clab)); > if (incr) > add_stmt (incr); > ... > > I tell it to add multiple copies of body and incr and the at the end > add in the loop the rest of it. I've also added support to remove > further unrolling to these modified loops and will be handling the > "No-unroll" pragma. I then let the rest of the optimization passes, > fuse the incrementations together if possible, etc. > > The initial results are quite good and seem to work and > produce good code. > > Currently, there are two possibilities : > > - If the loop is not in the form we want, for example: > > for (;i { > ... > } > > Do we still unroll even though we have to trust the user that the > number of unrolling will not break the semantics ? > &g
RE: Turning off unrolling to certain loops
Jc, How did you implement #pragma unroll? I checked other compilers. The pragma should govern the next immediate loop. It took me a while to find a not-so-elegant way to do that. I also implemented #pragma ivdep. These information are supposed to be passed through both tree and RTL levels and suvive all GCC optimization. I still have problem in handling combination of unroll and ivdep. Bingfeng > -Original Message- > From: fearyours...@gmail.com [mailto:fearyours...@gmail.com] > On Behalf Of Jean Christophe Beyler > Sent: 15 October 2009 16:34 > To: Zdenek Dvorak > Cc: Bingfeng Mei; gcc@gcc.gnu.org > Subject: Re: Turning off unrolling to certain loops > > You are entirely correct, I hadn't thought that through enough. > > So I backtracked and have just merged what Bingfeng Mei has done with > your code and have now a corrected version of the loop unrolling. > > What I did was directly modified tree_unroll_loop to handle the case > of a perfect unroll or not internally and then used something similar > to what you had done around that. I added what I think is needed to > stop unrolling of those loops in later passes. > > I'll be starting my tests but I can port it to 4.5 if you are > interested to see what I did. > Jc > > On Thu, Oct 15, 2009 at 6:00 AM, Zdenek Dvorak > wrote: > > Hi, > > > >> I faced a similar issue a while ago. I filed a bug report > >> (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36712) In the end, > >> I implemented a simple tree-level unrolling pass in our port > >> which uses all the existing infrastructure. It works quite well for > >> our purpose, but I hesitated to submit the patch because > it contains > >> our not-very-elegannt #prgama unroll implementation. > > > > could you please do so anyway? Even if there are some > issues with the > > #prgama unroll implementation, it could serve as a basis of a usable > > patch. > > > >> /* Perfect unrolling of a loop */ > >> static void tree_unroll_perfect_loop (struct loop *loop, > unsigned factor, > >> edge exit) > >> { > >> ... > >> } > >> > >> > >> > >> /* Go through all the loops: > >> 1. Determine unrolling factor > >> 2. Unroll loops in different conditions > >> -- perfect loop: no extra copy of original loop > >> -- other loops: the original version of loops to > execute the remaining iterations > >> */ > >> static unsigned int rest_of_tree_unroll (void) > >> { > > ... > >> tree niters = number_of_exit_cond_executions(loop); > >> > >> bool perfect_unrolling = false; > >> if(niters != NULL_TREE && niters!= chrec_dont_know > && TREE_CODE(niters) == INTEGER_CST){ > >> int num_iters = tree_low_cst(niters, 1); > >> if((num_iters % unroll_factor) == 0) > >> perfect_unrolling = true; > >> } > >> > >> /* If no. of iterations can be divided by unrolling > factor, we have perfect unrolling */ > >> if(perfect_unrolling){ > >> tree_unroll_perfect_loop(loop, unroll_factor, > single_dom_exit(loop)); > >> } > >> else{ > >> tree_unroll_loop (loop, unroll_factor, > single_dom_exit (loop), &desc); > >> } > > > > It would be better to move this test to tree_unroll_loop, and not > > duplicate its code in tree_unroll_perfect_loop. > > > > Zdenek > > > >
RE: Turning off unrolling to certain loops
The basic idea is the same as what is described in this thread. http://gcc.gnu.org/ml/gcc/2008-05/msg00436.html But I made many changes on Alex's method. Pragmas are encoded into names of the helper functions because they are not optimized out by tree-level optimizer. These pseudo functions are either be consumed by target-builtins.c if it is only used at tree-level (unroll) or be replaced in target-builtins.c with special rtl NOTE(ivdep). To ensure the right scope of these loop pragmas, I also modified c_parser_for_statement, c_parser_while_statemennt, etc, to check loop level. I define that these pragmas only control the next innnermost loop. Once the right scope of the pragma is determined, I actually generate two helper functions for each pragma. The second is to close the scope of the pragma. When the pragma is used, I just search backward for preceding helper function (tree-level) or special note (rtl-level) Bingfeng > -Original Message- > From: fearyours...@gmail.com [mailto:fearyours...@gmail.com] > On Behalf Of Jean Christophe Beyler > Sent: 15 October 2009 17:27 > To: Bingfeng Mei > Cc: Zdenek Dvorak; gcc@gcc.gnu.org > Subject: Re: Turning off unrolling to certain loops > > I implemented it like this: > > - I modified c_parser_for_statement to include a pragma tree node in > the loop with the unrolling request as an argument > > - Then during my pass to handle unrolling, I parse the loop > to find the pragma. > - I retrieve the unrolling factor and use a merge of Zdenek's > code and yours to perform either a perfect unrolling or and register > it in the loop structure > > - During the following passes that handle loop unrolling, I > look at that variable in the loop structure to see if yes or no, we > should allow the normal execution of the unrolling > > - During the expand, I transform the pragma into a note that > will allow me to remove any unrolling at that point. > > That is what I did and it seems to work quite well. > > Of course, I have a few things I am currently considering: > - Is there really a position that is better for the pragma node in > the tree representation ? > - Can other passes actually move that node out of a given loop > before I register it in the loop ? > - Should I, instead of keeping that node through the tree > optimizations, actually remove it and later on add it just before > expansion ? > - Can an RTL pass remove notes or move them out of a loop ? > - Can the tree node or note change whether or not an optimization > takes place? > - It is important to note that after the registration, the pragma > node or note are actually just there to say "don't do anything", > therefore, the number of nodes or notes in the loop is not important > as long as they are not moved out. > > Those are my current concerns :-), I can give more > information if requested, > Jc > > PS: What exactly was your solution to this issue? > > > On Thu, Oct 15, 2009 at 12:11 PM, Bingfeng Mei > wrote: > > Jc, > > How did you implement #pragma unroll? I checked other > compilers. The > > pragma should govern the next immediate loop. It took me a while to > > find a not-so-elegant way to do that. I also implemented > #pragma ivdep. > > These information are supposed to be passed through both > tree and RTL > > levels and suvive all GCC optimization. I still have > problem in handling > > combination of unroll and ivdep. > > > > Bingfeng > > > >> -Original Message- > >> From: fearyours...@gmail.com [mailto:fearyours...@gmail.com] > >> On Behalf Of Jean Christophe Beyler > >> Sent: 15 October 2009 16:34 > >> To: Zdenek Dvorak > >> Cc: Bingfeng Mei; gcc@gcc.gnu.org > >> Subject: Re: Turning off unrolling to certain loops > >> > >> You are entirely correct, I hadn't thought that through enough. > >> > >> So I backtracked and have just merged what Bingfeng Mei > has done with > >> your code and have now a corrected version of the loop unrolling. > >> > >> What I did was directly modified tree_unroll_loop to > handle the case > >> of a perfect unroll or not internally and then used > something similar > >> to what you had done around that. I added what I think is needed to > >> stop unrolling of those loops in later passes. > >> > >> I'll be starting my tests but I can port it to 4.5 if you are > >> interested to see what I did. > >> Jc > >> > >> On Thu, Oct 15, 2009 at 6:00 AM, Zdenek Dvorak > >> wrote: > >
How to avoid a tree node being garbage collected after C frontend?
Hello, I need to pass a tree node (section name from processing pragmas) from C frontend to main GCC body (used in TARGET_INSERT_ATTRIBUTES). I store the node in a global pointer array delcared in target.c. But the tree node is garbage collected in the end of c-parser pass, and causes an ICE later on. I am not familiar with GC part at all. How to prevent this from hanppening? I checked other targets. It seems v850 almost uses the same approach to implement section name pragma. Not sure if it has the same problem. Also the issue is very sensitive to certain condition. For example, with -save-temps option the bug disappear. Thanks, Bingfeng Mei
RE: How to avoid a tree node being garbage collected after C frontend?
Ian, Thanks. I tried to follow the examples, but it still doesn't work. Here is the related code: in target-c.c: extern GTY(()) tree pragma_ghs_sections[GHS_SECTION_COUNT]; ... pragma_ghs_sections[sec_num] = copy_node (sec_name); in target.c: ... section_name = pragma_ghs_sections[sec_num]; if (section_name == NULL_TREE) return; DECL_SECTION_NAME(decl) = section_name; ... When I watch the memory pragma_ghs_sections[sec_num] points to, it is motified by #0 0x006cb3e7 in memset () from /lib/tls/libc.so.6 #1 0xc4f0 in ?? () #2 0x08120da4 in poison_pages () at ../../src/gcc/ggc-page.c:1854 #3 0x08120ee6 in ggc_collect () at ../../src/gcc/ggc-page.c:1945 #4 0x080f3692 in c_parser_translation_unit (parser=0xf7f92834) at ../../src/gcc/c-parser.c:978 #5 0x08103bd7 in c_parse_file () at ../../src/gcc/c-parser.c:8290 So target.c won't get correct section_name. What is wrong here? I understood GTY marker tells GC that this global pointer contains access to GC allocated memory. Thanks for any input, Bingfeng > -Original Message- > From: Ian Lance Taylor [mailto:i...@google.com] > Sent: 09 November 2009 19:00 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: How to avoid a tree node being garbage collected > after C frontend? > > "Bingfeng Mei" writes: > > > I need to pass a tree node (section name from processing pragmas) > > from C frontend to main GCC body (used in > TARGET_INSERT_ATTRIBUTES). > > I store the node in a global pointer array delcared in target.c. > > But the tree node is garbage collected in the end of c-parser > > pass, and causes an ICE later on. I am not familiar with GC part > > at all. How to prevent this from hanppening? > > Mark the global variable with GTY(()). See many many existing > examples. > > Ian > >
RE: How to avoid a tree node being garbage collected after C frontend?
Thanks, it works. I should have read the internal manual more carefully :-) Cheers, Bingfeng > -Original Message- > From: Basile STARYNKEVITCH [mailto:bas...@starynkevitch.net] > Sent: 10 November 2009 12:20 > To: Bingfeng Mei > Cc: Ian Lance Taylor; gcc@gcc.gnu.org > Subject: Re: How to avoid a tree node being garbage collected > after C frontend? > > Bingfeng Mei wrote: > > Ian, > > Thanks. I tried to follow the examples, but it still doesn't work. > > Here is the related code: > > > > in target-c.c: > > extern GTY(()) tree pragma_ghs_sections[GHS_SECTION_COUNT]; > > > > Perhaps you need to make sure that target-c.c is processed by > gengtype, and that it does include the generated > gt-target-c.h file. > > Regards. > > > -- > Basile STARYNKEVITCH http://starynkevitch.net/Basile/ > email: basilestarynkevitchnet mobile: +33 6 8501 2359 > 8, rue de la Faiencerie, 92340 Bourg La Reine, France > *** opinions {are only mines, sont seulement les miennes} *** > >
Is this patch of vector shift in 4.5?
Hello, Andrew, I am wondering whether this patch you mentioned has made into 4.5? http://gcc.gnu.org/ml/gcc/2009-02/msg00381.html We would like to support it in our port if the frontend has be adapted to support it. Thanks, Bingfeng
RE: help on - adding a new pass to gcc
Did you add your new object file to OBJS-common list in Makefile.in? Bingfeng > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf Of ddmetro > Sent: 10 November 2009 16:25 > To: gcc@gcc.gnu.org > Subject: help on - adding a new pass to gcc > > > Hi All, > We are adding a new pass for - structural hazard > optimization - in > gcc. > We have added a rtl_opt_pass variable(pass_sched3) > declaration in > tree-pass.h and defined the same in a new file - sched-by-category.c > We then added a target in gcc/Makefile.in, as follows: > sched-by-category.o : sched-by-category.c $(CONFIG_H) $(SYSTEM_H) > coretypes.h $(TM_H) \ >$(RTL_H) $(SCHED_INT_H) $(REGS_H) hard-reg-set.h > $(FLAGS_H) insn-config.h > \ >$(FUNCTION_H) $(INSN_ATTR_H) $(TOPLEV_H) $(RECOG_H) > except.h $(PARAMS_H) > \ >$(TM_P_H) $(TARGET_H) $(CFGLAYOUT_H) $(TIMEVAR_H) tree-pass.h \ >$(DBGCNT_H) > > We are getting an error in passes.c - undefined reference to > 'pass_sched3' > Kindly guide us as to what is wrong in our approach > of adding a new > file to gcc build. > > Thanking You, > Dhiraj. > -- > View this message in context: > http://old.nabble.com/help-on---adding-a-new-pass-to-gcc-tp262 > 86452p26286452.html > Sent from the gcc - Dev mailing list archive at Nabble.com. > > >
Loop pragmas dilemma
Hi, Due to pressing requirements of our target processor/application, I am implementing several popular loop pragmas in our private porting. I've already implemented "unroll" and "ivdep", and am now working on "loop_count" to give GCC hints about number of iterations. The problem I am now facing is that GCC has many loop optimizations in both tree and rtl levels that change loop property. For example, loop versioning by unrolling called by predom pass and loop fissions by graphite passes. This makes loop_count simply wrong for transformed loop(s). What is best strategy? Updating loop count pragma to track changed loops, or disable loop optimizations altogether in presence of loop pragma? To less extent, loop optimizations also affect other loop pragmas. For example, I have to disable cunroll pass in presence of #pragma unroll because it is confusing for user. Does anyone know how other compilers, e.g., icc, handle such issues? Thanks for any input, Bingfeng Mei Broadcom UK
RE: Worth balancing the tree before scheduling?
Hello, It seems to me that tree balancing risk of producing wrong result due to overflow of subexpression. Say a = INT_MIN, b = 10, c = 10, d = INT_MAX. If ((a + b) + c) + d)) becomes ((a + b) + (c + d)) c + d will overflow and the original won't. So the behaviour of two are different. Though the architecture may manage to produce correct result, it is undefined I think. Cheers, Bingfeng > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf Of Ian Bolton > Sent: 20 November 2009 15:05 > To: gcc@gcc.gnu.org > Subject: Worth balancing the tree before scheduling? > > From some simple experiments (see below), it appears as > though GCC aims > to > create a lop-sided tree when there are constants involved > (func1 below), > but a balanced tree when there aren't (func2 below). > > Our assumption is that GCC likes having constants all near to > each other > to > aid with tree-based optimisations, but I'm fairly sure that, when it > comes > to scheduling, it would be better to have a balanced tree, so > sched has > more > choices about what to schedule next? > > The impact of limiting sched's options can be seen if we look at the > pseudo-assembly produced by GCC for our architecture: > > func1: > LSHIFT $c3,$c1,3 # tmp137, a, > ADD $c2,$c2,$c3 # tmp138, b, tmp137 > ADD $c1,$c2,$c1 #, tmp138, a > > We think it would be better to avoid using the temporary: > > func1: > ADD $c2,$c1,$c2 # tmp137, a, b > LSHIFT $c1,$c1,3 # tmp138, a, > ADD $c1,$c2,$c1 # , tmp137, tmp138 > > As it currently stands, sched doesn't have the option to do > this because > its input (shown in func.c.172r.asmcons below) is arranged > such that the > first add depends on the shift and the second add depends on the first > add. > > If the tree were balanced, sched would have the option to do the add > first. > And, providing the logic was there in sched, we could make it > choose to > schedule such that we limit the number of temporaries used. > > Maybe one of the RTL passes prior to scheduling has the potential to > balance the tree/RTL, but just isn't on our architecture? > > == > func.c: > -- > int func1 (int a, int b) > { > /* the original expression */ > return a + b + (a << 3); > } > > > int func2 (int a, int b, int c) > { > /* the original expression */ > return a + b + (a << c); > } > > > == > > == > func.c.129t.supress_extend: > -- > ;; Function func1 (func1) > > func1 (int a, int b) > { > : > return (b + (a << 3)) + a; > } > > func2 (int a, int b, int c) > { > : > return (b + a) + (a << c); > } > > > == > > == > func.c.172r.asmcons: > -- > > ;; Function func1 (func1) > > ;; Pred edge ENTRY [100.0%] (fallthru) > (note 5 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK) > > (insn 2 5 3 2 func.c:2 (set (reg/v:SI 134 [ a ]) > (reg:SI 1 $c1 [ a ])) 45 {*movsi} (expr_list:REG_DEAD > (reg:SI 1 > $c1 [ a ]) > (nil))) > > (note 3 2 4 2 NOTE_INSN_DELETED) > > (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG) > > (insn 7 4 8 2 func.c:2 (set (reg:SI 137) > (ashift:SI (reg/v:SI 134 [ a ]) > (const_int 3 [0x3]))) 80 {ashlsi3} (nil)) > > (insn 8 7 9 2 func.c:2 (set (reg:SI 138) > (plus:SI (reg:SI 2 $c2 [ b ]) > (reg:SI 137))) 65 {*addsi3} (expr_list:REG_DEAD > (reg:SI 137) > (expr_list:REG_DEAD (reg:SI 2 $c2 [ b ]) > (nil > > > (note 9 8 14 2 NOTE_INSN_DELETED) > > > (insn 14 9 20 2 func.c:5 (set (reg/i:SI 1 $c1) > (plus:SI (reg:SI 138) > (reg/v:SI 134 [ a ]))) 65 {*addsi3} (expr_list:REG_DEAD > (reg:SI 138) > (expr_list:REG_DEAD (reg/v:SI 134 [ a ]) > (nil > > > (insn 20 14 0 2 func.c:5 (use (reg/i:SI 1 $c1)) -1 (nil)) > > ;; Function func2 (func2) > > ;; Pred edge ENTRY [100.0%] (fallthru) > (note 6 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK) > > > (insn 2 6 3 2 func.c:8 (set (reg/v:SI 134 [ a ]) > (reg:SI 1 $c1 [ a ])) 45 {*movsi} (expr_list:REG_DEAD > (reg:SI 1 > $c1 [ a ]) > (nil))) > > > (note 3 2 4 2 NOTE_INSN_DELETED) > > > (note 4 3 5 2 NOTE_INSN_DELETED) > > > (note 5 4 8 2 NOTE_INSN_FUNCTION_BEG) > > > (insn 8 5 9 2 func.c:8 (set (reg:SI 138) > (plus:SI (reg:SI 2 $c2 [ b ]) > (reg/v:SI 134 [ a ]))) 65 {*addsi3} (expr_list:REG_DEAD > (reg:SI 2 $c2 [ b ]) > (nil))) > > > (insn 9 8 10 2 func.c:8 (set (reg:SI 139) > (ashift:SI (reg/v:SI 134 [ a ]) > (reg:SI 3 $c3 [ c ]))) 80 {ashlsi3} (expr_list:REG_DEAD > (reg/v:SI 134 [ a ]) > (expr_list:REG_DEAD (reg:SI 3 $c3 [
RE: HELP: data dependence
Data dependence analysis is done in sched-deps.c. You can have a look at build_intra_loop_deps function in ddg.c (which constructs data dependency graph for modulo scheduler) to see how it is used. Bingfeng > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf Of Jianzhang Peng > Sent: 03 December 2009 09:56 > To: gcc@gcc.gnu.org > Subject: HELP: data dependence > > I want to get data dependence information about an basic block, which > contains RTLs. > What functions or data structure should I use ? > > thanks > > -- > Jianzhang Peng > >
Unnecessary PRE optimization
Hello, I encounter an issue with PRE optimization, which created worse code than no optimization. This the test function: void foo(int *data, int *m_v4w, int num) { int i; int m0; for( i=0; i
RE: Unnecessary PRE optimization
-O2 > -Original Message- > From: Steven Bosscher [mailto:stevenb@gmail.com] > Sent: 23 December 2009 12:01 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org; dber...@dberlin.org > Subject: Re: Unnecessary PRE optimization > > On Wed, Dec 23, 2009 at 12:49 PM, Bingfeng Mei > wrote: > > Hello, > > I encounter an issue with PRE optimization, which created worse > > Is this at -O2 or -O3? > > Ciao! > Steven > >
RE: Unnecessary PRE optimization
Do you mean if TARGET_ADDRES_COST (non-x86) is defined properly, this should be fixed? Or it requires extra patch? Bingfeng > -Original Message- > From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On > Behalf Of Paolo Bonzini > Sent: 23 December 2009 13:28 > To: Steven Bosscher > Cc: Bingfeng Mei; gcc@gcc.gnu.org; dber...@dberlin.org > Subject: Re: Unnecessary PRE optimization > > On 12/23/2009 01:01 PM, Steven Bosscher wrote: > > On Wed, Dec 23, 2009 at 12:49 PM, Bingfeng > Mei wrote: > >> Hello, > >> I encounter an issue with PRE optimization, which created worse > > > > Is this at -O2 or -O3? > > I think this could be fixed if fwprop propagated addresses > into loops; > it doesn't because it made performance worse on x86. The > real reason is > "address_cost on x86 sucks and nobody knows how to fix it > exactly", but > the performance hit was bad enough that we (Steven Bosscher and I) > decided to put that hack into fwprop. > > Paolo > >
RE: Unnecessary PRE optimization
It seems that just commenting out this check in fwprop.c should work. /* Do not propagate loop invariant definitions inside the loop. */ /* if (DF_REF_BB (def)->loop_father != DF_REF_BB (use)->loop_father) return;*/ Bingfeng > -Original Message- > From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On > Behalf Of Paolo Bonzini > Sent: 23 December 2009 15:01 > To: Bingfeng Mei > Cc: Steven Bosscher; gcc@gcc.gnu.org; dber...@dberlin.org > Subject: Re: Unnecessary PRE optimization > > On 12/23/2009 03:27 PM, Bingfeng Mei wrote: > > Do you mean if TARGET_ADDRES_COST (non-x86) is defined properly, > > this should be fixed? Or it requires extra patch? > > No, if TARGET_ADDRESS_COST was fixed for x86 (and of course defined > properly for your target), we could fix this very easily. > > Paolo > >
RE: PowerPC : GCC2 optimises better than GCC4???
I can confirm that our target also generate GOOD code for this case. Maybe this is a EABI or target-specific thing, where Struct/union is forced to memory. Bingfeng Broadcom Uk > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf Of Andrew Haley > Sent: 04 January 2010 16:08 > To: gcc@gcc.gnu.org > Subject: Re: PowerPC : GCC2 optimises better than GCC4??? > > On 01/04/2010 12:07 PM, Jakub Jelinek wrote: > > On Mon, Jan 04, 2010 at 12:18:50PM +0100, Steven Bosscher wrote: > >>On Mon, Jan 4, 2010 at 12:02 PM, Andrew Haley > wrote: > >>> This optimization is done by the first RTL cse pass. I > can't understand > >>> why it's not being done for your target. I guess this will need a > >>> powerpc expert. > >> > >> Known bug, see http://gcc.gnu.org/PR22141 > > > > That's unrelated. PR22141 is about (lack of) merging of > adjacent stores of > > constant values into memory, but there are no memory stores > involved here, > > everything is in registers, so PR22141 patch will make zero > difference here. > > > > IMHO we really should have some late tree pass that > converts adjacent > > bitfield operations into integral operations on > non-bitfields (likely with > > alias set of the whole containing aggregate), as at the RTL > level many cases > > are simply too many instructions for combine etc. to > optimize them properly, > > while at the tree level it could be simpler. > > Yabbut, how come RTL cse can handle it in x86_64, but PPC not? > > Andrew. > >
RE: GCC-How does the coding style affect the insv pattern recognization?
Your instruction is likely too specific to be picked up by GCC. You may use an intrinisc for it. Bingfeng > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf Of fanqifei > Sent: 12 January 2010 12:50 > To: gcc@gcc.gnu.org > Subject: GCC-How does the coding style affect the insv > pattern recognization? > > Hi, > I am working on a micro controller and trying to port > gcc(4.3.2) for it. > There is insv instruction in our micro controller and I have add > define_insn to machine description file. > However, the insv instruction can only be generated when the code > is written like below. If the code is written using logical shift and > or operators, the insv instruction will not be generated. > For the statement: x= (x&0xFF00) | ((i<<16)&0x00FF); > 6 RTL instructions are generated after combine pass and 8 > instructions are generated in the assembly file. > Paolo Bonzini said that insv instruction might be synthesized > later by combine. But combine only works on at most 3 instructions and > insv is not generated in such case. > So exactly when will the insv pattern be recognized and how does > the coding style affect it? > Is there any open bug report about this? > > struct test_foo { > unsigned int a:18; > unsigned int b:2; > unsigned int c:12; > }; > > struct test_foo x; > > unsigned int foo() > { > unsigned int a=x.b; > x.b=2; > return a; > } > > Thanks! > fanqifei > >
RE: GCC-How does the coding style affect the insv pattern recognization?
OOPs, I don't know that. Anyway, I won't count on GCC to reliably pick up these complex patterns. In our port, we implemented clz/ffs/etc as intrinsics though they are present as standard patterns. Bingfeng > -Original Message- > From: fanqifei [mailto:fanqi...@gmail.com] > Sent: 13 January 2010 10:26 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: GCC-How does the coding style affect the insv > pattern recognization? > > 2010/1/13 Bingfeng Mei : > > Your instruction is likely too specific to be picked up by GCC. > > You may use an intrinisc for it. > > > > Bingfeng > > but insv is a standard pattern name. > the semantics of expression x= (x&0xFF00) | ((i<<16)&0x00FF); > is exactly what insv can do. > I all tried mips gcc cross compiler, and ins is also not generated. > Intrinsic is a way to resolve this though. Maybe there is no > other better way. > > BTW, > There is a special case(the bit position is 0): > 235: f0 97 fc mvi a9 -0x4; #move immediate to reg > 238: ff e9 94 and a9 a14 a9; > 23b: f0 95 02 or a9 0x2; > The above three instructions can be replaced by mvi and insv. But the > fact is not in the combine pass. > > Qifei Fan > >
A bug on 32-bit host?
Hello, I am tracking a bug and find the lshift_value function in expmed.c questionable (both head and gcc 4.4). Suppose HOST_BITS_PER_WIDE_INT = 32, bitpos = 0 and bitsize = 33, the following expression is wrong high = (v >> (HOST_BITS_PER_WIDE_INT - bitpos)) & ((1 << (bitpos + bitsize - HOST_BITS_PER_WIDE_INT)) - 1); v >> 32 bits on a 32-bit machine is undefined. On i386, v >> 32 results in v, which is not intention of the function. Cheers, Bingfeng Mei static rtx lshift_value (enum machine_mode mode, rtx value, int bitpos, int bitsize) { unsigned HOST_WIDE_INT v = INTVAL (value); HOST_WIDE_INT low, high; if (bitsize < HOST_BITS_PER_WIDE_INT) v &= ~((HOST_WIDE_INT) -1 << bitsize); if (bitpos < HOST_BITS_PER_WIDE_INT) { low = v << bitpos; /* Obtain value by shifting and set zeros for remaining part*/ if((bitpos + bitsize) > HOST_BITS_PER_WIDE_INT) high = (v >> (HOST_BITS_PER_WIDE_INT - bitpos)) & ((1 << (bitpos + bitsize - HOST_BITS_PER_WIDE_INT)) - 1); else high = 0; } else { low = 0; high = v << (bitpos - HOST_BITS_PER_WIDE_INT); } return immed_double_const (low, high, mode); }
RE: A bug on 32-bit host?
Oops, that is embarassing. Usually any local change are marked with #ifdef in our port. I shoud double check next time when I report an issue. Thanks. Bingfeng > -Original Message- > From: Ian Lance Taylor [mailto:i...@google.com] > Sent: 22 January 2010 15:04 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: A bug on 32-bit host? > > "Bingfeng Mei" writes: > > > /* Obtain value by shifting and set zeros for remaining part*/ > > if((bitpos + bitsize) > HOST_BITS_PER_WIDE_INT) > > high = (v >> (HOST_BITS_PER_WIDE_INT - bitpos)) > > & ((1 << (bitpos + bitsize - > HOST_BITS_PER_WIDE_INT)) - 1); > > That is not what expmed.c looks like on mainline or on gcc 4.4 branch. > You must have a local change. > > Ian > >
RE: GCC calling GNU assembler
GCC just literally emits the string in your asm expression together with other assembly code generated by compiler. Only in next step assembler is invoked by GCC driver. Typically, hard register number is not used so that GCC can do register allocation for inline assembly. Bingfeng > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf Of Nikola Ikonic > Sent: 03 February 2010 09:27 > To: gcc@gcc.gnu.org > Subject: GCC calling GNU assembler > > Hello all, > > Could anybody please answer me on following question: > > where is GCC callin assembler where it recognizes assembler code in C > function? For example, let's say that there is this line in C code: > > asm("mov r1,r0"); > > So, the parser parses this as an assembler string. But where, in GCC > code, is assembler called to process this string? > Or maybe the question is where this "mov r1, r0" string is passed to > assembler. Anyway, I think you got my question. > > Thanks in advance! > > Best regards, >Nikola > >
RE: Function versioning tests?
Hi, GCC 4.5 already contains such patch. http://gcc.gnu.org/ml/gcc-patches/2009-03/msg01186.html If you are working on 4.4 branch, you can just apply the patch without problem. Cheers, Bingfeng > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf Of Ian Bolton > Sent: 19 February 2010 17:09 > To: gcc@gcc.gnu.org > Subject: Function versioning tests? > > Hi there, > > I've changed our private port of GCC to give versioned functions > better names (rather than T.0, T.1), and was wondering if there > are any existing tests that push function-versioning to the limit, > so I can test whether my naming scheme is sound. > > Failing that, I'd appreciate some pointers on how I might make > such a test. I know I need to be passing a constant in as a > parameter, but I don't know what other criteria are required to > make it trigger. > > Cheers, > Ian > >
Issue in combine pass.
Hello, I experienced an ICE for no-scevccp-outer-16.c in our port. It seems not in other ports so I couldn't file a bug report. Baiscally, the problem appears after the following transformations in expand_compound_operation (combine.c). Enter expand_compound_operation x: (zero_extend:SI (subreg:HI (plus:V4HI (reg:V4HI 143 [ vect_var_.65 ]) (reg:V4HI 142 [ vect_var_.65 ])) 0)) tem = gen_lowpart (mode, XEXP (x, 0)); tem = (subreg:SI (plus:V4HI (reg:V4HI 143 [ vect_var_.65 ]) (reg:V4HI 142 [ vect_var_.65 ])) 0) tem = simplify_shift_const (NULL_RTX, ASHIFT, mode, tem, modewidth - pos - len); tem = (subreg:SI (ashift:V4HI (plus:V4HI (reg:V4HI 143 [ vect_var_.65 ]) (reg:V4HI 142 [ vect_var_.65 ])) (const_int 16 [0x10])) 0) tem = simplify_shift_const (NULL_RTX, unsignedp ? LSHIFTRT : ASHIFTRT, mode, tem, modewidth - len); /projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-16.c: In function 'main': /projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.dg/vect/no-scevccp-outer-16.c:59:1: internal compiler error: in trunc_int_for_mode, at explow.c:56 Please submit a full bug report, #0 internal_error (gmsgid=0xe9aa77 "in %s, at %s:%d") at ../../src/gcc/diagnostic.c:707 #1 0x005acf23 in fancy_abort (file=0xea8453 "../../src/gcc/explow.c", line=56, function=0xea8440 "trunc_int_for_mode") at ../../src/gcc/diagnostic.c:763 #2 0x0060528b in trunc_int_for_mode (c=65535, mode=V4HImode) at ../../src/gcc/explow.c:56 #3 0x005edf24 in gen_int_mode (c=65535, mode=V4HImode) at ../../src/gcc/emit-rtl.c:459 #4 0x00cf22d9 in simplify_and_const_int (x=0x0, mode=V4HImode, varop=0x2a957a8420, constop=65535) at ../../src/gcc/combine.c:9038 #5 0x00cf462f in simplify_shift_const_1 (code=LSHIFTRT, result_mode=SImode, varop=0x2a957a0600, orig_count=16) at ../../src/gcc/combine.c:10073 #6 0x00cf47cf in simplify_shift_const (x=0x0, code=LSHIFTRT, result_mode=SImode, varop=0x2a957a8408, count=16) at ../../src/gcc/combine.c:10122 #7 0x00cebbf9 in expand_compound_operation (x=0x2a95789c20) at ../../src/gcc/combine.c:6517 #8 0x00ce8afe in combine_simplify_rtx (x=0x2a95789c20, op0_mode=HImode, in_dest=0) at ../../src/gcc/combine.c:5535 #9 0x00ce6da5 in subst (x=0x2a95789c20, from=0x2a95781ba0, to=0x2a957a83a8, in_dest=0, unique_copy=0) at ../../src/gcc/combine.c:4884 #10 0x00ce6b53 in subst (x=0x2a957a0660, from=0x2a95781ba0, to=0x2a957a83a8, in_dest=0, unique_copy=0) at ../../src/gcc/combine.c:4812 #11 0x00ce13ed in try_combine (i3=0x2a957a1678, i2=0x2a957a1630, i1=0x0, new_direct_jump_p=0x7fbfffeafc) at ../../src/gcc/combine.c:2963 ... It seems to me that both the gen_lowpart and simplify_shift_const do the wrong things in handling vector type. (zero_extend:SI (subreg:HI (V4HI)) is not equal to (subreg:SI (V4HI)), is it? simplify_shift_const produces (ashift:V4HI (V4HI..) (16), which is not right either. Does shifting of a vector with a const value mean shifting every element of vector or treat the vector as an entity? Internal manual is not very clear about that. Thanks, Bingfeng Mei
Release novops attribute for external use?
Hello, One of our engineers requested a feature so that compiler can avoid to re-load variables after a function call if it is known not to write to memory. It should slash considerable code size in our applications. I found the existing "pure" and "const" cannot meet his requirements because the function is optimized out if it doesn't return a value. I almost started to implement a new attribute in our own port, only to find out "novops" attribute is exact what we want. Why "novops" is only limited to internal use? Does it has any other implication? Could we release this attribute for external use as well? Thanks, Bingfeng Mei
RE: Release novops attribute for external use?
Something like printf (Though I read somewhere glibc extension of printf make it non-pure). Bingfeng > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf Of Andrew Haley > Sent: 12 April 2010 17:34 > To: gcc@gcc.gnu.org > Subject: Re: Release novops attribute for external use? > > On 04/12/2010 05:27 PM, Bingfeng Mei wrote: > > Hello, > > One of our engineers requested a feature so that > > compiler can avoid to re-load variables after a function > > call if it is known not to write to memory. It should > > slash considerable code size in our applications. I found > > the existing "pure" and "const" cannot meet his requirements > > because the function is optimized out if it doesn't return > > a value. > > If a function doesn't write to memory and it doesn't return a > value, what is the point of calling it? > > Andrew. > >
RE: Release novops attribute for external use?
> > Surely printf writes to global memory (it clobbers the stdout FILE*) > OK, the point is not about whether printf is pure or not. Instead, if programmer knows the callee function such as printf contains no memory access that affects operations inside caller function, and he would like to have a way to optimize the code. Our engineer gave following example: void myfunc(MyStruct *myStruct) { int a,b; a = myStruct->a; printf("a=%d\n",a); b = 2*mystruct->a; // I would like to have the compiler acting as if I had written b = 2*a; ... } Providing such attribute may be potentially dangerous. But it is just like "restrict" qualifier and some other attributes, putting responsibilty of correctness on the programmer. "novops" seems to achieve that effect, though its semantics doesn't match exactly what I described. > As for the original question - novops is internal only because its > semantics is purely internal and changes with internal aliasing > changes. > > Now, we still lack a compelling example to see what exact semantics > you are requesting? I suppose it might be close to a pure but > volatile function? Which you could simulate by > > dummy = pure_fn (); > asm ("" : "g" (dummy)); > > or even > > volatile int dummy = pure_fn (); These two methods still generate extra code to reload variables Bingfeng > > Richard. > > > Bingfeng > > > >> -Original Message- > >> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > >> Behalf Of Andrew Haley > >> Sent: 12 April 2010 17:34 > >> To: gcc@gcc.gnu.org > >> Subject: Re: Release novops attribute for external use? > >> > >> On 04/12/2010 05:27 PM, Bingfeng Mei wrote: > >> > Hello, > >> > One of our engineers requested a feature so that > >> > compiler can avoid to re-load variables after a function > >> > call if it is known not to write to memory. It should > >> > slash considerable code size in our applications. I found > >> > the existing "pure" and "const" cannot meet his requirements > >> > because the function is optimized out if it doesn't return > >> > a value. > >> > >> If a function doesn't write to memory and it doesn't return a > >> value, what is the point of calling it? > >> > >> Andrew. > >> > >> > > > >
RE: Release novops attribute for external use?
Thanks! I forgot to declare the function as pure. The empty asm seems to be a clever trick to avoid function being optimized out. I shall tell our engineers to use this instead of implementing a new attribute. Bingfeng > -Original Message- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: 13 April 2010 11:25 > To: Bingfeng Mei > Cc: Andrew Haley; gcc@gcc.gnu.org > Subject: Re: Release novops attribute for external use? > > On Tue, Apr 13, 2010 at 12:23 PM, Richard Guenther > wrote: > > On Tue, Apr 13, 2010 at 12:15 PM, Bingfeng Mei > wrote: > >>> > >>> Surely printf writes to global memory (it clobbers the > stdout FILE*) > >>> > >> OK, the point is not about whether printf is pure or not. > Instead, if > >> programmer knows the callee function such as printf contains no > >> memory access that affects operations inside caller > function, and he > >> would like to have a way to optimize the code. Our > engineer gave following > >> example: > >> > >> void myfunc(MyStruct *myStruct) > >> { > >> int a,b; > >> a = myStruct->a; > >> printf("a=%d\n",a); > >> b = 2*mystruct->a; // I would like to have the > compiler acting as if I had written b = 2*a; > >> ... > >> } > >> Providing such attribute may be potentially dangerous. But > it is just > >> like "restrict" qualifier and some other attributes, > putting responsibilty > >> of correctness on the programmer. "novops" seems to > achieve that effect, > >> though its semantics doesn't match exactly what I described. > > > > Indeed. IPA pointer analysis will probably figure it out > > automagically - that *myStruct didn't escape the unit. > > Being able to annotate incoming pointers this way would > > maybe be useful. > > > >>> As for the original question - novops is internal only because its > >>> semantics is purely internal and changes with internal aliasing > >>> changes. > >>> > >>> Now, we still lack a compelling example to see what exact > semantics > >>> you are requesting? I suppose it might be close to a pure but > >>> volatile function? Which you could simulate by > >>> > >>> dummy = pure_fn (); > >>> asm ("" : "g" (dummy)); > >>> > >>> or even > >>> > >>> volatile int dummy = pure_fn (); > >> > >> These two methods still generate extra code to reload variables > > > > The latter works for me (ok, the store to dummy is retained): > > > > extern int myprintf(int) __attribute__((pure)); > > int myfunc (int *p) > > { > > int a; > > a = *p; > > volatile int dummy = myprintf(a); > > return a + *p; > > } > > > > myfunc: > > .LFB0: > > pushq %rbx > > .LCFI0: > > subq $16, %rsp > > .LCFI1: > > movl (%rdi), %ebx > > movl %ebx, %edi > > call myprintf > > movl %eax, 12(%rsp) > > leal (%rbx,%rbx), %eax > > addq $16, %rsp > > .LCFI2: > > popq %rbx > > .LCFI3: > > ret > > > > so we load from %rdi only once. > > And > > extern int myprintf(int) __attribute__((pure)); > int myfunc (int *p) > { > int a; > a = *p; > int dummy = myprintf(a); > asm ("" : : "g" (dummy)); > return a + *p; > } > > produces > > myfunc: > .LFB0: > pushq %rbx > .LCFI0: > movl(%rdi), %ebx > movl%ebx, %edi > callmyprintf > leal(%rbx,%rbx), %eax > popq%rbx > .LCFI1: > ret > > even better. > > Richard. > >
Which target has working modulo scheduling?
Hello, I tried to enable modulo scheduling for our target VLIW. It fails even for the simplest loop. I would like to have a look at how GCC produces schedule for other targets. I know that modulo scheduling relies on doloop_end pattern to identify a pipelineable loop. There are only a handful of targets supporting doloop_end. Which among them are known to work well with modulo scheduling? Thanks in advance. Cheers, Bingfeng Mei Broadcom UK
Is there any plan for "data propagation from Tree SSA to RTL" to be in GCC mainline?
Hello, I found current modulo pipelining very inefficient for many loops. One reason is primitive cross-iteration memory dependency analysis. The add_inter_loop_mem_dep function in ddg.c just draws true dependency between every write and read pair. This is quite inadequate since many loops read from memory at the beginning of the loop and wrte to the memory at the end. In the end, we obtain schedule no better than list scheduling. I am aware of this work of propagating Tree-level dependency info to RTL (http://sysrun.haifa.il.ibm.com/hrl/greps2007/papers/melnik-propagation-greps2007.pdf). It should help a lot in improving memory dependency analysis. Is there any plan for this work to make into GCC mainline? Thanks in advance. Kind Regards, Bingfeng Mei Broadcom UK
RE: Is there any plan for "data propagation from Tree SSA to RTL" to be in GCC mainline?
I found the the GsoC project and patch here (only 2007) http://code.google.com/soc/2007/gcc/appinfo.html?csaid=E0FEBB869A5F65A8 Is this patch only for propagating data dependency or does it include propagating alias info as well? Bingfeng > -Original Message- > From: Andrey Belevantsev > [mailto:[EMAIL PROTECTED] On Behalf Of Andrey Belevantsev > Sent: 09 November 2008 20:31 > To: Diego Novillo > Cc: Steven Bosscher; Bingfeng Mei; gcc@gcc.gnu.org; > [EMAIL PROTECTED]; Daniel Berlin > Subject: Re: Is there any plan for "data propagation from > Tree SSA to RTL" to be in GCC mainline? > > Diego Novillo wrote: > > On Sun, Nov 9, 2008 at 06:38, Steven Bosscher > <[EMAIL PROTECTED]> wrote: > > > >> Wasn't there a GSoC project for this last year? And this year? > >> > >> It'd be interesting to hear if anything came out of that... > > > > Nothing came of that, unfortunately. > There are two patches, actually. The patch of propagating data > dependences to RTL is ready and working, it wasn't (at that time) > committed just because it was initially completed during stage3. The > patch for propagating alias info wasn't finished within the scope of > this year's GSoC, unfortunately, and I take it more as my > fault than a > student's fault, as I failed to help him locally with > organizing his work. > > We are nevertheless trying to put some work into finishing > this patch. > As it is not completed yet, I don't have a subject to > discuss. I hope > that before the next stage1 we'll manage to finish the patches and to > unify them before submitting, as the mechanism they use for > mapping MEMs > to trees is the same. If we'd not finish the second patch, > we'll submit > the first one anyways. > > Sorry for not writing this earlier -- I've had a few busy > months (mostly > finishing and defending ph.d. thesis :) > > Andrey > >
RE: Is there any plan for "data propagation from Tree SSA to RTL" to be in GCC mainline?
I found it quite hard to merge the patch into current trunk HEAD since many things has changed. Do you know which revision you use? I would like to have a test to see whether it is effective in solving memory dependency issue in SMS. Thanks. Bingfeng > -Original Message- > From: Andrey Belevantsev [mailto:[EMAIL PROTECTED] > Sent: 11 November 2008 13:53 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: Is there any plan for "data propagation from > Tree SSA to RTL" to be in GCC mainline? > > Bingfeng Mei wrote: > > I found the the GsoC project and patch here (only 2007) > > > http://code.google.com/soc/2007/gcc/appinfo.html?csaid=E0FEBB8 > 69A5F65A8 > > > > Is this patch only for propagating data dependency or does > it include propagating alias info as well? > The patch at http://gcc.gnu.org/ml/gcc/2007-12/msg00240.html > (I presume > this is the same patch, I'm just giving you the link to its > submission > to the GCC ML) only does propagating data dependency info. > > Andrey > >
RE: generate assembly mnemonic depending the resource allocation
You can use C statements to return a modified template string such like (define_insn "addsi3" [(set (match_operand:SI 0 "general_register_operand" "=d") (plus:SI (match_operand:SI 1 "general_register_operand" "d") (match_operand:SI 2 "general_register_operand" "d")))] "" { switch (slot-used){ case 0: return "add-slot0, %0, %1, %2"; case 1: return "add-slot1, %0, %1, %2"; case 2: return "add-slot1, %0, %1, %2"; } } [(set_attr "type" "alu") (set_attr "mode" "SI") (set_attr "length" "1")]) > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On > Behalf Of Alex Turjan > Sent: 03 December 2008 10:34 > To: gcc@gcc.gnu.org > Subject: generate assembly mnemonic depending the resource allocation > > Hi all, > Im building a gcc target for a vliw machine that can execute > the same instruction on different resources (slots) and > depending on which resources are allocate the instruction > must have a different mnemonic. Is it possible in gcc to have > for the same define_insn constraints (depending on the > allocated architecture resources) different assembly instructions? > > Here is an example: > Consider the following addSI RTL pattern: > (define_insn "addsi3" > [(set (match_operand:SI 0 "general_register_operand" "=d") > (plus:SI (match_operand:SI 1 "general_register_operand" "d") > (match_operand:SI 2 > "general_register_operand" "d")))] > "" > "add %0,%1,%2%" > [(set_attr "type" "alu") >(set_attr "mode" "SI") >(set_attr "length" "1")]) > > On my target machine "alu" is a reservation that occupies one > of the following 3 slots: "slot1|slot2|slot3" and, I need to > generate assembly code with different mnemonic depending on > which slot the instruction was scheduled: > > add-slot1 %0,%1,%2% // if scheduled on slot 1 > add-slot2 %0,%1,%2% // if scheduled on slot 2 > add-slot3 %0,%1,%2% // if scheduled on slot 3 > > Alex > > > > > > > > > > > > > >
Bug in optimize_bitfield_assignment_op()?
Hello, My GCC porting for our own VLIW processor tracks mainline weekly. Test 991118-1.c has failed since two weeks ago. Following is a simplified version of 99118-1.c. After some investigation, I found the following statement is expanded to RTL wrongly. ;; tmp2.field = () () ((long long int) tmp2.field ^ 0x8765412345678); (insn 9 8 10 /projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.c-torture/execute/991118-1.c:23 (set (reg/f:SI 88) (symbol_ref:SI ("tmp2") [flags 0x2] )) -1 (nil)) (insn 10 9 11 /projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.c-torture/execute/991118-1.c:23 (set (reg:DI 89) (mem/s/j/c:DI (reg/f:SI 88) [0+0 S8 A64])) -1 (nil)) (insn 11 10 12 /projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.c-torture/execute/991118-1.c:23 (set:DI (reg:DI 90) (const_int 284280 [0x45678])) -1 (nil)) < wrong constant (insn 12 11 13 /projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.c-torture/execute/991118-1.c:23 (set (reg:DI 91) (xor:DI (reg:DI 89) (reg:DI 90))) -1 (nil)) (insn 13 12 0 /projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.c-torture/execute/991118-1.c:23 (set (mem/s/j/c:DI (reg/f:SI 88) [0+0 S8 A64]) (reg:DI 91)) -1 (nil)) Insn 11 only preserves the lower 20-bit of the 52-bit long constant. Further investigation shows the problem arises in optimize_bitfield_assignment_op function (expr.c). ... case BIT_XOR_EXPR: if (TREE_CODE (op1) != INTEGER_CST) break; value = expand_expr (op1, NULL_RTX, GET_MODE (str_rtx), EXPAND_NORMAL); value = convert_modes (GET_MODE (str_rtx), TYPE_MODE (TREE_TYPE (op1)), value, TYPE_UNSIGNED (TREE_TYPE (op1))); /* We may be accessing data outside the field, which means we can alias adjacent data. */ if (MEM_P (str_rtx)) { str_rtx = shallow_copy_rtx (str_rtx); set_mem_alias_set (str_rtx, 0); set_mem_expr (str_rtx, 0); } binop = TREE_CODE (src) == BIT_IOR_EXPR ? ior_optab : xor_optab; if (bitpos + bitsize != GET_MODE_BITSIZE (GET_MODE (str_rtx))) { rtx mask = GEN_INT (((unsigned HOST_WIDE_INT) 1 << bitsize) < Suspected bug - 1); value = expand_and (GET_MODE (str_rtx), value, mask, NULL_RTX); } value = expand_shift (LSHIFT_EXPR, GET_MODE (str_rtx), value, build_int_cst (NULL_TREE, bitpos), NULL_RTX, 1); result = expand_binop (GET_MODE (str_rtx), binop, str_rtx, value, str_rtx, 1, OPTAB_WIDEN); Here the bitpos = 0, bitsize = 52. HOST_WIDE_INT for our processor is 32, though 64-bit long long type is supported. The marked statement produces a mask of 0xf, thus causes the upper 32-bit removed later. Is this a potential bug, or did I miss something? I also tried the older version (> 2 weeks ago). This function is not called at all, so can produce correct code. Cheers, Bingfeng Broadcom UK
[IRA] New register allocator question
Hello, I recently ported our GCC to new IRA by following mainline development. The only interface I added is IRA_COVER_CLASSES. Our architecture has predicate register file. When predicate register has to be spilled, the new IRA produces inferior code to the old register allocator. The old allocator first tries to spill to general register file, which is far cheaper on our architecture than spilling to memory. The IRA always spills the predicate register to memory directly. #define IRA_COVER_CLASSES \ { \ GR_REGS, PR_REGS, M_REGS, BXBC_REGS, LIM_REG_CLASSES \ } Apart from above macro, what other interfaces/parameters I can tune to change this behaviour in new IRA? Thanks in advance. Happy New Year, Bingfeng Mei Broadcom UK.
RE: [IRA] New register allocator question
I found if I define a new register class that covers both GR_REGS and PR_REGS, the issue can be solved. New IRA spill the predicate register to general regsister first instead of memory. Is this right approach? #define IRA_COVER_CLASSES \ { \ GRPR_REGS, M_REGS, BXBC_REGS, LIM_REG_CLASSES \ } > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf Of Bingfeng Mei > Sent: 02 January 2009 11:50 > To: gcc@gcc.gnu.org > Cc: Vladimir Makarov > Subject: [IRA] New register allocator question > > Hello, > I recently ported our GCC to new IRA by following mainline > development. The only interface I added is > IRA_COVER_CLASSES. Our architecture has predicate register > file. When predicate register has to be spilled, the new IRA > produces inferior code to the old register allocator. The > old allocator first tries to spill to general register file, > which is far cheaper on our architecture than spilling to > memory. The IRA always spills the predicate register to > memory directly. > > #define IRA_COVER_CLASSES \ > { \ > GR_REGS, PR_REGS, M_REGS, BXBC_REGS, LIM_REG_CLASSES \ > } > > Apart from above macro, what other interfaces/parameters I > can tune to change this behaviour in new IRA? Thanks in advance. > > Happy New Year, > Bingfeng Mei > > Broadcom UK. > > >
Document error on TARGET_ASM_NAMED_SECTION ?
Hello, According to current GCC internal manual. http://gcc.gnu.org/onlinedocs/gccint/File-Framework.html#index-TARGET_005fASM_005fNAMED_005fSECTION-4335 - Target Hook: void TARGET_ASM_NAMED_SECTION (const char *name, unsigned int flags, unsigned int align) Output assembly directives to switch to section name. The section should have attributes as specified by flags, which is a bit mask of the SECTION_* flags defined in output.h. If align is nonzero, it contains an alignment in bytes to be used for the section, otherwise some target default should be used. Only targets that must specify an alignment within the section directive need pay attention to align - we will still use ASM_OUTPUT_ALIGN. But the actually the third argument should be "tree decl" instead of "unsigned int align". The following is the default hook. default_elf_asm_named_section (const char *name, unsigned int flags, tree decl ATTRIBUTE_UNUSED) Is it an error or do I miss something? Cheers, Bingfeng Mei
Solve transitive closure issue in modulo scheduling
Hello, I try to make modulo scheduling work more efficiently for our VLIW target. I found one serious issue that prevents current SMS algorithm from achieving high IPC is so-called "transitive closure" problem, where scheduling window is only calculated using direct predecessors and successors. Because SMS is not an iterative algorithm, this may cause failures in finding a valid schedule. Without splitting rows, some simple loops just cannot be scheduled not matter how big the II is. With splitting rows, schedule can be found, but only at bigger II. GCC wiki (http://gcc.gnu.org/wiki/SwingModuloScheduling) lists this as a TODO. Is there any work going on about this issue (the last wiki update was one year ago)? If no one is working on it, I plan to do it. My idea is to use the MinDist algorithm described in B. Rau's classic paper "iterative modulo scheduling" (http://www.hpl.hp.com/techreports/94/HPL-94-115.html). The same algorithm can also be used to compute better RecMII. The biggest concern is complexity of computing MinDist matrix, which is O(N^3). N is number of nodes in the loop. I remember somewhere GCC coding guide says "never write quadratic algorithm" :-) Is this an absolute requirement? If yes, I will keep it as our target-specific code (we are less concerned about compilation time). Otherwise, I will try to make it more generic to see if it can make into mainline in 4.5. Any comments? Cheers, Bingfeng Mei Broadcom UK
Difference between vec_shl_ and ashl3
Hello, Could anyone explain to me what is difference between vec_shl_ and ashl3 patterns? It seems to me that both shift a vector operand 1 with scalar operand 2. I tried to understand some targets' implemenation, e.g., ia64 as follows, and cannot grasp their difference. Does the "whole vector shift" of vec_shl means treating a vector as a long scalar? Thanks in advance. (define_insn "lshr3" [(set (match_operand:VECINT24 0 "gr_register_operand" "=r") (lshiftrt:VECINT24 (match_operand:VECINT24 1 "gr_register_operand" "r") (match_operand:DI 2 "gr_reg_or_5bit_operand" "rn")))] "" "pshr.u %0 = %1, %2" [(set_attr "itanium_class" "mmshf")]) (define_expand "vec_shr_" [(set (match_operand:VECINT 0 "gr_register_operand" "") (lshiftrt:DI (match_operand:VECINT 1 "gr_register_operand" "") (match_operand:DI 2 "gr_reg_or_6bit_operand" "")))] "" { operands[0] = gen_lowpart (DImode, operands[0]); operands[1] = gen_lowpart (DImode, operands[1]); }) Cheers, Bingfeng Mei Broadcom UK
RE: Difference between vec_shl_ and ashl3
Ian, Thanks for prompt reply. Just out of curiosity. Isn't this naming convention for shift instructions inconsistent with other patterns? For example, we can define add3 and GCC will automatically use it by vectorization or in plus expression of two vector types. Why does shift need special names? Bingfeng > -Original Message- > From: Ian Lance Taylor [mailto:i...@google.com] > Sent: 10 February 2009 14:31 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: Difference between vec_shl_ and > ashl3 > > "Bingfeng Mei" writes: > > > Could anyone explain to me what is difference between > > vec_shl_ and ashl3 patterns? It > seems to me > > that both shift a vector operand 1 with scalar operand 2. > > The difference is that with a vector mode gcc will look for > the standard > name vec_shl_MODE, and with a non-vector mode gcc will look for the > standard name lshlMODE or ashlMODE. > > > I tried to understand some targets' implemenation, e.g., ia64 as > > follows, and cannot grasp their difference. > > The name which matters is vec_shr_. The fact that the > ia64 names > the real insn mode3 does not imply that that insn name > is actually > used by anything. vec_shr_ is a define_expand which is expands > into a pattern which is recognized by the mode3 insn. > The name of > the mode3 insn could change or be removed and everything would > work. > > Ian > >
Native support for vector shift
Hello, For the targets that support vectors, we can write the following code: typedef short V4H __attribute__ ((vector_size (8))); V4H tst(V4H a, V4H b){ return a + b; } Other operators such as -, *, |, &, ^ etc are also supported. However, vector shift is not supported by frontend, including both scalar and vector second operands. V4H tst(V4H a, V4H b){ return a << 3; } V4H tst(V4H a, V4H b){ return a << b; } Currently, we have to use intrinsics to support such shift. Isn't syntax of vector shift intuitive enough to be supported natively? Someone may argue it breaks the C language. But vector is a GCC extension anyway. Support for vector add/sub/etc already break C syntax. Any thought? Sorry if this issue had been raised in past. Greetings, Bingfeng Mei Broadcom UK
RE: Instrument gcc
Did you compile with -O0? A function may be inlined and a symbol may be optimized away with -O1 and above. Bingfeng > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf Of Vincent R. > Sent: 24 February 2009 15:38 > To: gcc@gcc.gnu.org > Subject: Instrument gcc > > Hi, > > even if I am simple mortal I would like to understand or at > least follow > what is going on with gcc. > Generally when I run gdb and try to breakpoint inside a > function I get a > undefined symbol or something like that. > I suppose this is because gcc is not a simple static exe but > depends on > other binaries (g++, cpp, ...). > So my question is how can I debug step by step gcc ? > Let's say for instance I want to breakpoint the function > init_exception_processing located in gcc/gcc/cp > and related to c++ exceptions > > This GDB was configured as "i486-linux-gnu"... > (gdb) b init_exception_processing > Function "init_exception_processing" not defined. > Make breakpoint pending on future shared library load? (y or [n]) > > What is the magical trick to be able to follow what is going on. > > Thanks > > >
RE: Native support for vector shift
Yes, at least the first case (scalar operand 2) is supported by valarray. http://www.reading.ac.uk/SerDepts/su/Topic/Pgram/PgSWC+FP01/Workshop/stdlib/stdref/val_6244.htm#Non-member%20Binary%20Operators Additionally, if we follow valarray guideline, GCC should also support code like: V4H a, c; short b; c = a + b; Instead of using c = a + (V4H){b, b, b, b}; This can be useful. > -Original Message- > From: Joseph Myers [mailto:jos...@codesourcery.com] > Sent: 24 February 2009 18:52 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: Native support for vector shift > > On Tue, 24 Feb 2009, Bingfeng Mei wrote: > > > Currently, we have to use intrinsics to support such shift. > Isn't syntax > > of vector shift intuitive enough to be supported natively? > Someone may > > argue it breaks the C language. But vector is a GCC > extension anyway. > > Support for vector add/sub/etc already break C syntax. Any thought? > > The general guideline we've followed for C vector extensions > is "like C++ > valarray". Does it support this? (This isn't an absolute > rule in either > direction, but a useful guide and a set of semantics that have been > well-tested in practice.) > > -- > Joseph S. Myers > jos...@codesourcery.com > >
RE: Native support for vector shift
Yes, I am aware of both types of vector shift. Our target VLIW actually supports both and I have implemented all related patterns in our porting. But it would be still nice to allow programmer explicitly use vector shift, preferably both types. Bingfeng > -Original Message- > From: Michael Meissner [mailto:meiss...@linux.vnet.ibm.com] > Sent: 24 February 2009 21:07 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: Native support for vector shift > > On Tue, Feb 24, 2009 at 06:15:37AM -0800, Bingfeng Mei wrote: > > Hello, > > For the targets that support vectors, we can write the > following code: > > > > typedef short V4H __attribute__ ((vector_size (8))); > > > > V4H tst(V4H a, V4H b){ > > return a + b; > > } > > > > Other operators such as -, *, |, &, ^ etc are also > supported. However, vector shift > > is not supported by frontend, including both scalar and > vector second operands. > > > > V4H tst(V4H a, V4H b){ > > return a << 3; > > } > > > > V4H tst(V4H a, V4H b){ > > return a << b; > > } > > > > Currently, we have to use intrinsics to support such shift. > Isn't syntax of vector > > shift intuitive enough to be supported natively? Someone > may argue it breaks the > > C language. But vector is a GCC extension anyway. Support > for vector add/sub/etc > > already break C syntax. Any thought? Sorry if this issue > had been raised in past. > > Note, internally there are two different types of vector > shift. Some machines > support a vector shift by a scalar, some machines support a > vector shift by a > vector. One future machine (x86_64 with -msse5) can support > both types of > vector shifts. > > The auto vectorizer now can deal with both types: > > for (i = 0; i < n; i++) > a[i] = b[i] << c > > will generate a vector shift by a scalar on machines with > that support, and > splat the scalar into a vector for the second set of machines. > > If the machine only has vector shift by a scalar, the auto > vectorizer will not > generate a vector shift for: > > for (i = 0; i < n; i++) > a[i] = b[i] << c[i] > > Internally, the compiler uses the standard shift names for > vector shift by a > scalar (i.e. ashl, ashr, lshl), and a v > prefix for the vector > by vector shifts (i.e. vashl, vashr, vlshl). > > The rotate patterns are also similar. > > -- > Michael Meissner, IBM > 4 Technology Place Drive, MS 2203A, Westford, MA, 01886, USA > meiss...@linux.vnet.ibm.com > >
Why are these two functions compiled differently?
Hello, I came across the following example and their .final_cleanup files. To me, both functions should produce the same code. But tst1 function actually requires two extra sign_extend instructions compared with tst2. Is this a C semantics thing, or GCC mis-compile (over-conservatively) in the first case. Cheers, Bingfeng Mei Broadcom UK #define A 255 int tst1(short a, short b){ if(a > (b - A)) return 0; else return 1; } int tst2(short a, short b){ short c = b - A; if(a > c) return 0; else return 1; } .final_cleanup ;; Function tst1 (tst1) tst1 (short int a, short int b) { : return (int) b + -254 > (int) a; } ;; Function tst2 (tst2) tst2 (short int a, short int b) { : return (short int) ((short unsigned int) b + 65281) >= a; }
RE: Why are these two functions compiled differently?
Should I file a bug report? If it is not a C semantics thing, GCC certainly produces unnecessarily big code. .file "tst.c" .text .p2align 4,,15 .globl tst1 .type tst1, @function tst1: .LFB0: .cfi_startproc movswl %si,%esi movswl %di,%edi xorl%eax, %eax subl$254, %esi cmpl%edi, %esi setg%al ret .cfi_endproc .LFE0: .size tst1, .-tst1 .p2align 4,,15 .globl tst2 .type tst2, @function tst2: .LFB1: .cfi_startproc subw$255, %si xorl%eax, %eax cmpw%di, %si setge %al ret .cfi_endproc .LFE1: .size tst2, .-tst2 .ident "GCC: (GNU) 4.4.0 20090218 (experimental) [trunk revision 143368]" .section.note.GNU-stack,"",@progbits > -Original Message- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: 03 March 2009 15:16 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org; John Redford > Subject: Re: Why are these two functions compiled differently? > > On Tue, Mar 3, 2009 at 4:06 PM, Bingfeng Mei > wrote: > > Hello, > > I came across the following example and their > .final_cleanup files. To me, both functions should produce > the same code. But tst1 function actually requires two extra > sign_extend instructions compared with tst2. Is this a C > semantics thing, or GCC mis-compile (over-conservatively) in > the first case. > > Both transformations are already done by the fronted (or fold), likely > shorten_compare is quilty for tst1 and fold_unary for tst2 (which > folds (short)((int)b - (int)A). > > Richard. > > > Cheers, > > Bingfeng Mei > > Broadcom UK > > > > > > #define A 255 > > > > int tst1(short a, short b){ > > if(a > (b - A)) > > return 0; > > else > > return 1; > > > > } > > > > > > int tst2(short a, short b){ > > short c = b - A; > > if(a > c) > > return 0; > > else > > return 1; > > > > } > > > > > > .final_cleanup > > ;; Function tst1 (tst1) > > > > tst1 (short int a, short int b) > > { > > : > > return (int) b + -254 > (int) a; > > > > } > > > > > > > > ;; Function tst2 (tst2) > > > > tst2 (short int a, short int b) > > { > > : > > return (short int) ((short unsigned int) b + 65281) >= a; > > > > } > > > > > > > > > >
Is const_int zero extended or sign-extended?
Hello, I am confused by one very basic concept :). In the following rtx expression, if const_int is 32-bit and DImode is 64-bit, will the const_int sign-extended or zero-extended. In other word, is the content of reg:DI 95 0x9 or 0x9 after this instruction? (set:DI (reg:DI 95) (const_int -7 [0xfff9])) Thanks, Bingfeng Mei
Understand BLKmode and returning structure in register.
Hello, I came across an issue regarding BLKmode and returning structure in register. For following code, I try to return the structure in register instead of memory. extern void abort(); typedef struct { short x; short y; } COMPLEX; COMPLEX foo (void) __attribute__ ((noinline)); COMPLEX foo (void) { COMPLEX x; x.x = -7; x.y = -7; return x; } int main(){ COMPLEX x = foo(); if(x.y != -7) abort(); } In foo function, compute_record_mode function will set the mode for struct COMPLEX as BLKmode partly because STRICT_ALIGNMENT is 1 on my target. In TARGET_RETURN_IN_MEMORY hook, I return 1 for BLKmode type and 0 otherwise for small size (<8) (like MIPS). Thus, this structure is still returned through memory, which is not very efficient. More importantly, ABI is NOT FIXED under such situation. If an assembly code programmer writes a function returning a structure. How does he know the structure will be treated as BLKmode or otherwise? So he doesn't know whether to pass result through memory or register. Do I understand correctly? On the other hand, if I return 0 only according to struct type's size regardless BLKmode or not, GCC will produces very inefficient code. For example, stack setup code in foo is still generated even it is totally unnecessary. Only when I set STRICT_ALIGNMENT to 0, the structure can be passed through register in an efficient way. Unfortunately, our machine is strictly aligned and I cannot really do that. Any suggestion? Thanks, Bingfeng Mei Broadcom UK
RE: Understand BLKmode and returning structure in register.
I found that compiling for mips with -mabi=n32 produces such inefficient code. When -mabi=n32, mips_return_in_memory returns 0 if size is small regardless BLKmode or not. .type foo, @function foo: .frame $sp,16,$31 # vars= 16, regs= 0/0, args= 0, gp= 0 addiu $sp,$sp,-16 li $2,-7 # 0xfff9 sh $2,0($sp) sh $2,2($sp) ld $3,0($sp) addiu $sp,$sp,16 dsrl$4,$3,32 andi$4,$4,0x dsrl$3,$3,48 dsll$4,$4,32 dsll$2,$3,48 j $31 or $2,$2,$4 .entmain .type main, @function main: addiu $sp,$sp,-48 sd $31,40($sp) jal foo nop dsra$3,$2,32 dsrl$2,$2,48 sh $3,18($sp) sh $2,16($sp) lw $2,16($sp) sll $3,$2,16 sw $2,0($sp) sra $3,$3,16 li $2,-7 # 0xfff9 bne $3,$2,$L8 ld $31,40($sp) j $31 addiu $sp,$sp,48 $L8: jal abort nop With old ABI, produced code is much simpler but the structure is returned through memory. mips_retrun_in_memory returns 1 because the structure type is BLKmode. foo: li $3,-7 # 0xfff9 move$2,$4 sh $3,0($4) j $31 sh $3,2($4) .entmain .type main, @function main: .frame $sp,32,$31 # vars= 8, regs= 1/0, args= 16, gp= 0 addiu $sp,$sp,-32 sw $31,28($sp) jal foo addiu $4,$sp,16 lh $3,18($sp) li $2,-7 # 0xfff9 bne $3,$2,$L8 nop lw $31,28($sp) nop j $31 addiu $sp,$sp,32 $L8: jal abort nop > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On > Behalf Of Bingfeng Mei > Sent: 13 March 2009 16:35 > To: gcc@gcc.gnu.org > Cc: Adrian Ashley > Subject: Understand BLKmode and returning structure in register. > > Hello, > I came across an issue regarding BLKmode and returning > structure in register. For following code, I try to return > the structure in register instead of memory. > > extern void abort(); > typedef struct { > short x; > short y; > } COMPLEX; > > COMPLEX foo (void) __attribute__ ((noinline)); > COMPLEX foo (void) > { > COMPLEX x; > > x.x = -7; > x.y = -7; > > return x; > } > > > int main(){ > COMPLEX x = foo(); > if(x.y != -7) > abort(); > } > > > In foo function, compute_record_mode function will set the > mode for struct COMPLEX as BLKmode partly because > STRICT_ALIGNMENT is 1 on my target. In > TARGET_RETURN_IN_MEMORY hook, I return 1 for BLKmode type and > 0 otherwise for small size (<8) (like MIPS). Thus, this > structure is still returned through memory, which is not very > efficient. More importantly, ABI is NOT FIXED under such > situation. If an assembly code programmer writes a function > returning a structure. How does he know the structure will be > treated as BLKmode or otherwise? So he doesn't know whether > to pass result through memory or register. Do I understand correctly? > > On the other hand, if I return 0 only according to struct > type's size regardless BLKmode or not, GCC will produces very > inefficient code. For example, stack setup code in foo is > still generated even it is totally unnecessary. > > Only when I set STRICT_ALIGNMENT to 0, the structure can be > passed through register in an efficient way. Unfortunately, > our machine is strictly aligned and I cannot really do that. > > Any suggestion? > > Thanks, > Bingfeng Mei > Broadcom UK > > >
Typo or intended?
Hello, I just updated our porting to include last 2-3 weeks of GCC developments. I noticed a large number of test failures at -O1 that use a user-defined data type (based on a special register file of our processor). All variables of such type are now spilled to memory which we don't allow at -O1 because it is too expensive. After investigation, I found that it is the following new code causes the trouble. I don't quite understand the function of the new code, but I don't see what's special for -O1 in terms of register allocation in comparison with higher optimizing levels. If I change it to (optimize < 1), everthing is fine as before. I start to wonder whether (optimize <= 1) is a typo or intended. Thanks in advance. Cheers, Bingfeng Mei Broadcom UK if ((! flag_caller_saves && ALLOCNO_CALLS_CROSSED_NUM (a) != 0) /* For debugging purposes don't put user defined variables in callee-clobbered registers. */ || (optimize <= 1 <- why include -O1? && (attrs = REG_ATTRS (regno_reg_rtx [ALLOCNO_REGNO (a)])) != NULL && (decl = attrs->decl) != NULL && VAR_OR_FUNCTION_DECL_P (decl) && ! DECL_ARTIFICIAL (decl))) { IOR_HARD_REG_SET (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a), call_used_reg_set); IOR_HARD_REG_SET (ALLOCNO_CONFLICT_HARD_REGS (a), call_used_reg_set); } else if (ALLOCNO_CALLS_CROSSED_NUM (a) != 0) { IOR_HARD_REG_SET (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a), no_caller_save_reg_set); IOR_HARD_REG_SET (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a), temp_hard_reg_set); IOR_HARD_REG_SET (ALLOCNO_CONFLICT_HARD_REGS (a), no_caller_save_reg_set); IOR_HARD_REG_SET (ALLOCNO_CONFLICT_HARD_REGS (a), temp_hard_reg_set); }
RE: Understand BLKmode and returning structure in register.
Thanks for the reply. There should be more opportunties for strictly aligned machines. In my example, the structure is a local variable allocated on stack. I don't see why it is marked as BLKmode. Compiler has full freedom to make it aligned and use DImode instead. Bingfeng > -Original Message- > From: Richard Sandiford [mailto:rdsandif...@googlemail.com] > Sent: 16 March 2009 22:14 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org; Adrian Ashley > Subject: Re: Understand BLKmode and returning structure in register. > > "Bingfeng Mei" writes: > > In foo function, compute_record_mode function will set the mode for > > struct COMPLEX as BLKmode partly because STRICT_ALIGNMENT is 1 on my > > target. In TARGET_RETURN_IN_MEMORY hook, I return 1 for BLKmode type > > and 0 otherwise for small size (<8) (like MIPS). Thus, > this structure > > is still returned through memory, which is not very efficient. More > > importantly, ABI is NOT FIXED under such situation. If an assembly > > code programmer writes a function returning a structure. How does he > > know the structure will be treated as BLKmode or otherwise? So he > > doesn't know whether to pass result through memory or register. Do I > > understand correctly? > > Yes. I think having TARGET_RETURN_IN_MEMORY depend on > internal details > like the RTL mode is often seen as an historical mistake. As you say, > the ABI should be defined directly by the type instead. > > Unfortunately, once you start using a mode, it's difficult to stop > using a mode without breaking compatibility. So one of the > main reasons > the MIPS port still uses the mode is because no-one dares touch it. > > Likewise, it's now difficult to change the mode attached to a > structure > (which could potentially make structure accesses more > efficient) without > accidentally breaking someone's ABI. > > > On the other hand, if I return 0 only according to struct > type's size > > regardless BLKmode or not, GCC will produces very inefficient > > code. For example, stack setup code in foo is still > generated even it > > is totally unnecessary. > > Yeah, there's definitely room for improvement here. And as you say, > it's already a problem for MIPS. I think it's just one of > those things > that doesn't occur often enough in critical code for anyone to have > spent time optimising it. > > Richard > >
RE: Is const_int zero extended or sign-extended?
I am tracking a bug, not sure whether it is a generic GCC bug or my porting goes wrong. To access the below structure, typedef struct { long int p_x, p_y; } Point; ... p1.p_x = -1; ... It is expanded to follwing RTL ;; p1.p_x = -1; (insn 19 18 20 /projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.c-torture/execute/2808-1.c:14 (set (reg:DI 98) (ior:DI (reg/v:DI 87 [ p1 ]) (const_int -1 [0x]))) -1 (nil)) (insn 20 19 0 /projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.c-torture/execute/2808-1.c:14 (set (reg/v:DI 87 [ p1 ]) (reg:DI 98)) -1 (nil)) According to your explaination, (reg:DI 98) will get -1 (0x) after insn 19, and is wrong. Am I right? Thanks, Bingfeng > -Original Message- > From: Dave Korn [mailto:dave.korn.cyg...@googlemail.com] > Sent: 12 March 2009 17:53 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: Is const_int zero extended or sign-extended? > > Bingfeng Mei wrote: > > Hello, I am confused by one very basic concept :). In the > following rtx > > expression, if const_int is 32-bit and DImode is 64-bit, > will the const_int > > sign-extended or zero-extended. In other word, is the > content of reg:DI 95 > > 0x9 or 0x9 after this instruction? > > > > (set:DI (reg:DI 95) (const_int -7 [0xfff9])) > > > > Thanks, Bingfeng Mei > > > > IIUC in the absence of any explicit extension operation, a > const_int is > taken to be whatever size the object it is assigned to, with > the value given > by the signed decimal interpretation. That RTL sets reg 95 > to a DImode -7. > > Is this part of a larger problem? > > cheers, > DaveK > >
RE: Typo or intended?
That's fine. It seems that other targets don't have such issue. Our target is too special and it is still a private port. I can just use optimize < 1 here. Thanks, Bingfeng > -Original Message- > From: Vladimir Makarov [mailto:vmaka...@redhat.com] > Sent: 23 March 2009 19:40 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: Typo or intended? > > Bingfeng Mei wrote: > > Hello, > > I just updated our porting to include last 2-3 weeks of GCC > developments. I noticed a large number of test failures at > -O1 that use a user-defined data type (based on a special > register file of our processor). All variables of such type > are now spilled to memory which we don't allow at -O1 because > it is too expensive. After investigation, I found that it is > the following new code causes the trouble. I don't quite > understand the function of the new code, but I don't see > what's special for -O1 in terms of register allocation in > comparison with higher optimizing levels. If I change it to > (optimize < 1), everthing is fine as before. I start to > wonder whether (optimize <= 1) is a typo or intended. Thanks > in advance. > > > > > Sorry for the delay with the answer. I was on vacation last week. > > As Andrew Haley guess, it was intended. I thought that improving > debugging for -O1 is also important (more important than > optimization). > Although GCC manual says > > With `-O', the compiler tries to reduce code size and execution > time, without performing any optimizations that take a great deal > of compilation time. > > it also says > > `-O' also turns on `-fomit-frame-pointer' on machines where doing > so does not interfere with debugging. > > Therefore I've decided to do analogous thing for the patch. > May be I am > wrong. We could do this only for -O0 if people really want > this which I > am not sure about. > > Cheers, > > Bingfeng Mei > > Broadcom UK > > > > if ((! flag_caller_saves && ALLOCNO_CALLS_CROSSED_NUM > (a) != 0) > > /* For debugging purposes don't put user defined variables in > > callee-clobbered registers. */ > > || (optimize <= 1 > <- why include -O1? > > && (attrs = REG_ATTRS (regno_reg_rtx > [ALLOCNO_REGNO (a)])) != NULL > > && (decl = attrs->decl) != NULL > > && VAR_OR_FUNCTION_DECL_P (decl) > > && ! DECL_ARTIFICIAL (decl))) > > { > > IOR_HARD_REG_SET (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a), > > call_used_reg_set); > > IOR_HARD_REG_SET (ALLOCNO_CONFLICT_HARD_REGS (a), > > call_used_reg_set); > > } > > else if (ALLOCNO_CALLS_CROSSED_NUM (a) != 0) > > { > > IOR_HARD_REG_SET (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a), > > no_caller_save_reg_set); > > IOR_HARD_REG_SET (ALLOCNO_TOTAL_CONFLICT_HARD_REGS (a), > > temp_hard_reg_set); > > IOR_HARD_REG_SET (ALLOCNO_CONFLICT_HARD_REGS (a), > > no_caller_save_reg_set); > > IOR_HARD_REG_SET (ALLOCNO_CONFLICT_HARD_REGS (a), > > temp_hard_reg_set); > > } > > > > >
gcc99 inlining rules
Hello, I found the following code doesn't compile with gcc4.4. and -std=c99. Does this behaviour conform to standard? inline int foo(){ return 10; } int main(int argc, char **argv){ return foo(); } I goolged the c99 inlining rule as follows. They does't seem to say such code cannot be compiled. C99 inline rules The specification for "inline" is section 6.7.4 of the C99 standard (ISO/IEC 9899:1999). This isn't freely available, but you can buy a PDF of it from ISO relatively cheaply. * A function where all the declarations (including the definition) mention "inline" and never "extern". There must be a definition in the same translation unit. No stand-alone object code is emitted. You can (must?) have a separate (not inline) definition in another translation unit, and the compiler might choose either that or the inline definition. Such functions may not contain modifiable static variables, and may not refer to static variables or functions elsewhere in the source file where they are declared. * A function where at least one declaration mentions "inline", but where some declaration doesn't mention "inline" or does mention "extern". There must be a definition in the same translation unit. Stand-alone object code is emitted (just like a normal function) and can be called from other translation units in your program. The same constraint about statics above applies here, too. * A function defined "static inline". A local definition may be emitted if required. You can have multiple definitions in your program, in different translation units, and it will still work. This is the same as the GNU C rules. Cheers, Bingfeng Mei
RE: gcc99 inlining rules
Link error. /tmp/ccqpP1D1.o: In function `main': tst.c:(.text+0x15): undefined reference to `foo' collect2: ld returned 1 exit status As Joseph said, I found the original text in c99 standard in section 6.7.4. " EXAMPLE The declaration of an inline function with external linkage can result in either an external definition, or a definition available for use only within the translation unit. A file scope declaration with extern creates an external definition. The following example shows an entire translation unit. inline double fahr(double t) { return (9.0 * t) / 5.0 + 32.0; } inline double cels(double t) { return (5.0 * (t - 32.0)) / 9.0; } extern double fahr(double); // creates an external definition double convert(int is_fahr, double temp) { /* A translator may perform inline substitutions */ return is_fahr ? cels(temp) : fahr(temp); } 8 Note that the definition of fahr is an external definition because fahr is also declared with extern, but the definition of cels is an inline definition. Because cels has external linkage and is referenced, an external definition has to appear in another translation unit (see 6.9); the inline definition and the external definition are distinct and either may be used for the call. " I understand now the GCC implementation conforms to c99, but don't see rationale behind it :-). Anyway, this is not gcc dev question any more. Cheers, Bingfeng > -Original Message- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: 31 March 2009 15:32 > To: Bingfeng Mei > Cc: gcc@gcc.gnu.org > Subject: Re: gcc99 inlining rules > > On Tue, Mar 31, 2009 at 4:24 PM, Bingfeng Mei > wrote: > > Hello, > > I found the following code doesn't compile with gcc4.4. and > -std=c99. Does this behaviour conform to standard? > > > > inline int foo(){ > > return 10; > > } > > > > int main(int argc, char **argv){ > > return foo(); > > } > > It works for me. What is your error? > > Richard. > >