[VTA merge] Some dwarf problems
Hi Alexandre, I was having some trouble with dwarf sections in picochip port. I am not a dwarf expert, but when i looked at the changes in r151312, file dwarf2out.c, function dwarf2out_var_location on line 17965, we have sprintf (loclabel, "%s-1", last_label); ... What is last_label-1 supposed to point to? Thanks for your help. Hari
Re: [VTA merge] Some dwarf problems
Thanks for the pointer, Jakub. Cheers Hari Jakub Jelinek wrote: On Mon, Sep 21, 2009 at 05:04:27PM +0100, Hariharan wrote: Hi Alexandre, I was having some trouble with dwarf sections in picochip port. I am not a dwarf expert, but when i looked at the changes in r151312, file dwarf2out.c, function dwarf2out_var_location on line 17965, we have sprintf (loclabel, "%s-1", last_label); ... What is last_label-1 supposed to point to? See http://gcc.gnu.org/ml/gcc-patches/2009-06/msg01317.html for details. 1 byte before last_label label (which is usually right after a call insn). The intent is to have something in the middle of a call insn. Jakub
Re: fbranch-probabilities bug
Hi Seongbae, Does that mean that someone cant use the profile just to annotate branches (and get better code by that), without having to get the additional baggage of "unroll-loops", "peel-loops" etc? In my case, i am interested in not bloating the code size, but get any performance that is to be had from profiling. Is that possible? Note: My profile generate phase was also just -fprofile-arcs since i am not interested in other kinds of profile. Cheers Hari Seongbae Park ??? ??? wrote: This is the intended behavior, though now I see that the documentation isn't very clear. You need to use -fprofile-use - the typical usage scenario is to compile with -fprofile-generate to build an executable to do profile collection, and then compile with -fprofile-use to build optimized code using the profile data. Seongbae On Thu, Jan 8, 2009 at 6:30 AM, Hariharan Sandanagobalane wrote: Hi Seongbae, I was doing some work on profiling for picochip, when i noticed what looks to me like a bug. It looks to me that using fbranch-probabilities on the commandline (after a round of profile-generate or profile-arcs) would just not work on any target. Reason.. Coverage.c:1011 if (flag_profile_use) read_counts_file (); Should this not be if (flag_profile_use || flag_branch_probabilities) // Maybe more flags read_counts_file (); ?? Of course, i hit the problem later on since the counts were not read, it just assumed that the .gcda file were not available, when it actually was. Thanks Hari
Re: fbranch-probabilities bug
Seongbae Park ??? ??? wrote: On Thu, Jan 8, 2009 at 10:11 AM, Hariharan wrote: Hi Seongbae, Does that mean that someone cant use the profile just to annotate branches (and get better code by that), without having to get the additional baggage of "unroll-loops", "peel-loops" etc? You can do that by selectively turning optimizations off (e.g. -fprofile-use -fno-unroll-loops -fno-peel-loops ). In my case, i am interested in not bloating the code size, but get any performance that is to be had from profiling. Is that possible? Note: My profile generate phase was also just -fprofile-arcs since i am not interested in other kinds of profile. Have you measured the impact on the performance and the code size from using full -fprofile-generate/-fprofile-use ? Well, No... I cannot. I have just about managed to get -fprofile-arcs and -fbranch-probabilities work with picochip. The code runs under a simulator and i have had to hack both GCC code and libgcov code to get the simulator output the profile in the format that would be acceptable to Gcc in the second run. Doing the same with the additional profiles is going to be a hard task. Its a target that has MEM processor versions, which has 6KB (yes KB not MB or GB) Instruction memory, at best. So, you can understand why code size is very important to us. Anyway, it looks to me that we might get very little performance benefit without bloating the code with PBO, so that makes it very unattractive for me to do anything along this line. By the way, your changes to smoothen the profile information in GCC 4.4 helped a lot. From where I had 13 profiling tests(drawn from GCC dejagnu testsuite) failing in GCC 4.3.2 with "corrupted profile info" messages, it got down to just one failure in GCC 4.4. Thanks for that. Cheers Hari If yes, and you have seen any performance degradation or unnecessary code bloat from other optimization, please file a bug. If not, then I'd say you probably want to try measuring it - in particular, value profiling has been becoming more and more useful. And in my experience, majority of the code size increase as well as the performance benefit with -fprofile-use comes from extra inlining (which -fprofile-arcs then -fbranch-probabilities also enable). Seongbae
GCC Profile base optimizations using simulator profile
Hi, I just wanted to see if there are others out there who get profile information from a simulator and feed that information back for GCC's PBO, in the .gcda format. I had tried this on picoChip, by changing the instrumentation code in GCC for fprofile-arcs and got edge profile working quite well (but GCC 4.4 would not accept just edge profile). I have never attempted to try others (indirect call, value profile etc), but would like to know the results of anyone who might have tried. Thanks. Hari
Inline limits
Hi, I ran into some code-size/stack size bloat using -Os for a piece of code. This seemed to happen only when certain single call-site functions are defined "static" and not otherwise. On investigating further on this, i see that the inline_functions_called_once seems to rely only on "cgraph_check_inline_limits", whereas other inlining code go through more rigorous cost-benefit analysis to decide on inlining (especially with INLINE_SIZE). I have been looking at re-setting some of the parameters used in "cgraph_check_inline_limits" for inlining for picochip. I could not understand the way PARAM_LARGE_FUNCTION_GROWTH and PARAM_STACK_FRAME_GROWTH are used in this function. Both of these parameters are used as a fraction of the bigger (or to) function. I want to be able to say, if the inlining would increase the code size or stack frame size, dont inline. Otherwise, go ahead an inline. Of course, i am compiling this code at -Os, so this condition is probably obvious. Can you advice me on how to use these parameters to do that? A side question... Are 'static' single call-site functions always inlined? I would hope not (under -Os), but just checking. Thanks Hari PS: If this were to be considered a "bug", i will file a report with a testcase.
scheduler dependency bug in the presence of var_location and unspec_volatile
Hello, I saw a bug in sched1 where it reorders two unspec_volatile instructions. These instructions do port communications (from the same port) and doing them in the wrong order is unacceptable. I digged a bit deeper to see what is happening. Going into sched1, the relevant bit of basic block is (debug_insn 184 183 185 12 autogenerated_UlSymbolRateCtrlDummy.c:58 (var_location:SI converter$rawValue (unspec_volatile:SI [ (const_int 3 [0x3]) ] 8)) -1 (nil)) (insn 185 184 186 12 /home/gccuser/systems/products/lib/umtsfdd/rel8_200903/uplink/UlSymbolRate/src/UlSymbolRateCtrlDummy.c:58 (set (subreg:SI (reg/v:DI 299 [ trchHeader ]) 0) (unspec_volatile:SI [ (const_int 3 [0x3]) ] 8)) 80 {commsGet} (nil)) (note 186 185 188 12 NOTE_INSN_DELETED) (note 188 186 189 12 NOTE_INSN_DELETED) (insn 189 188 190 12 /home/gccuser/systems/products/lib/umtsfdd/rel8_200903/uplink/UlSymbolRate/src/UlSymbolRateCtrlDummy.c:58 (set (reg:HI 280 [ trchHeader$D1530$channelCodingEnum ]) (lshiftrt:HI (subreg:HI (reg/v:DI 299 [ trchHeader ]) 0) (const_int 14 [0xe]))) 64 {lshrhi3} (nil)) (debug_insn 190 189 191 12 (var_location:QI trchHeader$D1530$channelCodingEnum (subreg:QI (reg:HI 280 [ trchHeader$D1530$channelCodingEnum ]) 0)) -1 (nil)) (debug_insn 191 190 192 12 (var_location:QI trchHeader$D1530$channelCodingEnum (subreg:QI (reg:HI 280 [ trchHeader$D1530$channelCodingEnum ]) 0)) -1 (nil)) (note 192 191 193 12 NOTE_INSN_DELETED) (debug_insn 193 192 194 12 autogenerated_UlSymbolRateCtrlDummy.c:58 (var_location:SI converter$rawValue (unspec_volatile:SI [ (const_int 3 [0x3]) ] 8)) -1 (nil)) (insn 194 193 195 12 /home/gccuser/systems/products/lib/umtsfdd/rel8_200903/uplink/UlSymbolRate/src/UlSymbolRateCtrlDummy.c:59 (set (subreg:SI (reg/v:DI 299 [ trchHeader ]) 4) (unspec_volatile:SI [ (const_int 3 [0x3]) ] 8)) 80 {commsGet} (nil)) Note that 185 and 194 are the actual port communication instructions here. If i look at the scheduler forward dependency for this basic block (at sched1), it looks like this ;; == ;; -- basic block 12 from 185 to 212 -- before reload ;; == ;; --- forward dependences: ;; --- Region Dependences --- b 12 bb 0 ;; insn codebb dep prio cost reservation ;; -- --- --- ;; 1858012 0 2 1 slot1 : 212 193 191 190 189 ;; 1896412 1 1 1 slot0|slot1 : 212 193 191 190 ;; 190-112 2 0 0 nothing : 193 191 ;; 191-112 3 0 0 nothing : 193 ;; 193-112 4 0 0 nothing : 199 194 ;; 1948012 0 5 1 slot1 : 212 206 205 204 203 202 201 200 199 198 ;; 1986412 1 4 1 slot0|slot1 : 212 206 202 200 199 ;; 199-112 3 0 0 nothing : 206 200 ;; 200-112 3 0 0 nothing : 206 201 ;; 201-112 2 0 0 nothing : 206 202 ;; 202-112 3 0 0 nothing : 206 203 ;; 203-112 2 0 0 nothing : 206 204 ;; 204-112 2 0 0 nothing : 206 205 ;; 205-112 2 0 0 nothing : 207 206 ;; 2068212 2 3 1 slot1 : 212 210 209 208 207 ;; 207-112 2 0 0 nothing : 210 208 ;; 208-112 2 0 0 nothing : 210 209 ;; 209-112 2 0 0 nothing : 211 210 ;; 2108212 1 2 1 slot1 : 212 211 ;; 211-112 2 0 0 nothing : ;; 212 712 6 1 1 (slot0+slot1+slot2) : ;; dependencies resolved: insn 185 ;; tick updated: insn 185 into ready ;; dependencies resolved: insn 194 ;; tick updated: insn 194 into ready ;; Advanced a state. ;; Ready list after queue_to_ready:194:87 185:82 ;; Ready list after ready_sort:185:82 194:87 ;; Clock 0 ;; Ready list (t = 0):185:82 194:87 ;; Chosen insn : 194 ;;0--> 194 r299#4=unspec/v[0x3] 8:slot1 ;; resetting: debug insn 193 Note that there is a dependency 185->193->194. Insn 193 is a debug_insn for var_location. When we actually get to scheduling, we seem to ignore this dependency and put both 185 and 194 into ready state and 194 gets picked, causing my test to go wrong. I do not have much experience working
Machine description question
Hello all, Picochip has communication instructions that allow one array element to pass data to another. There are 3 such instructions PUT/GET/TSTPORT. Currently, all three of these use UNSPEC_VOLATILE side-effect expressions to make sure they don't get reordered. But, i wonder if it is an overkill to use UNSPEC_VOLATILE for this purpose and whether i should use UNSPEC instead. The only thing we care here is that they don't reordered with respect to each other. It is okay for other instructions to move around the communication instructions (as long as normal scheduler dependencies are taken care of). There are possibly one of two things i can do. 1. Introduce an implicit dependency between all communication instructions by adding a use/clobber of an imaginary register. 2. Introduce explicit dependency between them by using some target hook to add dependency links. I have not found any appropriate target hook to do this. Can you tell me which one i should try? Has anyone tried doing anything similar? Any pointers/suggestions on this will be greatly appreciated. Thanks Hari
delay branch bug?
Hello all, I found something a little odd with delay slot scheduling. If i had the following bit of code (Note that "get" builtin functions in picochip stand for port communication) int mytest () { int a[5]; int i; for (i = 0; i < 5; i++) { a[i] = (int) getctrlIn(); } switch (a[3]) { case 0: return 4; default: return 13; } } The relevant bit of assembly for this compiled at -Os is _L2: GET 0,R[5:4]// R[5:4] := PORT(0) _picoMark_LBE5= _picoMark_LBE4= .loc 1 13 0 STW R4,(R3)0// Mem((R3)0{byte}) := R4 ADD.0 R3,2,R3 // R3 := R3 + 2 (HI) .loc 1 11 0 SUB.0 R3,R2,r15 // CC := (R3!=R2) BNE _L2 =-> LDW (FP)3,R5// R5 = Mem((FP)6{byte}) .loc 1 22 0 =-> is the delay slot marker. Note that the LDW instruction has been moved into the delay slot. This corresponds to the load in "switch (a[3]" statement above. The first 3 times around this loop, LDW would be loading uninitialised memory. The loaded value is ignored until we come out of the loop and hence the code is functionally correct, but i am not sure introduction of uninitialised memory access by the compiler when there was none in the source is good. I browsed around the delay branch code in reorg.c, but couldn't find anything that checks for this. Is this the intended behaviour? Can anyone familiar with delay branch code help? Thanks Hari
Re: [Bug rtl-optimization/44013] VTA produces wrong code
Hi Jakub, I have not had any response from Alexandre on this yet and i haven't had much luck in mailing list either (http://gcc.gnu.org/ml/gcc/2010-04/msg00917.html). Is there anyone else who is familiar with VTA who could help? Thanks Hari jakub at gcc dot gnu dot org wrote:
Re: New picoChip port and maintainers
Thanks to the GCC SC for accepting the picochip port. Regards Hari David Edelsohn wrote: I am pleased to announce that the GCC Steering Committee has accepted the picoChip port for inclusion in GCC and appointed Hariharan Sandanagobalane and Daniel Towner as port maintainers. The initial patch needs approval from a GCC GWP maintainer before it may be committed. Please join me in congratulating Hari and Daniel on their new role. Please update your listing in the MAINTAINERS file. Happy hacking! David
Re: New picoChip port and maintainers
Hi David/SC, Thanks again for accepting the picochip port in GCC. Although the picochip port has been accepted by the Steering Committee, we have had trouble getting a GWP maintainer to review the port. All the GWP maintainers seem to be extremely busy. I have emailed all of them, but haven't been successful in getting a review. In light of this, would it be possible for the SC to allow the port to be reviewed by other port maintainers? Regards Hari David Edelsohn wrote: I am pleased to announce that the GCC Steering Committee has accepted the picoChip port for inclusion in GCC and appointed Hariharan Sandanagobalane and Daniel Towner as port maintainers. The initial patch needs approval from a GCC GWP maintainer before it may be committed. Please join me in congratulating Hari and Daniel on their new role. Please update your listing in the MAINTAINERS file. Happy hacking! David
Re: Optimising for size
Hi Joel, I ran into a similar problem moving from 4.2.2 to 4.3.0. I looked a bit into it and found that 4.3 compiler inlines more aggressively than 4.2.x compiler. The reason was that the following two lines were removed from opts.c set_param_value ("max-inline-insns-single", 5); set_param_value ("max-inline-insns-auto", 5); Of course, there were other changes made to make sure code size didnt increase with this change. But, the other changes depend on PARAM_INLINE_CALL_COST. The default of 16 was too high for our target (picochip). You might want to try to reduce this value and see if your code-size woes go away. Regards Hari Joe Buck wrote: On Mon, Jul 14, 2008 at 10:04:08AM +1000, [EMAIL PROTECTED] wrote: I have a piece of C code. The code, compiled to an ARM THUMB target using gcc 4.0.2, with -Os results in 230 instructions. The exact same code, using the exact same switches compiles to 437 instructions with gcc 4.3.1. Considering that the compiler optimises to size and the much newer compiler emits almost twice as much code as the old one, I think it is an issue. Agreed. I think it's a regression. Using -Os and getting much larger code would qualify. So the question is, how should I report it? Open a PR with the complete test case, and the command line options you used with 4.0.2 and 4.3.1. Please cc me on the PR. I would like to track this one and if you provide a preprocessed test case can quickly check the size on 3.2.3, 4.1.1, 4.2.4, 4.3.1 and the trunk. Use joel AT gcc DOT gnu.org Thanks. -- Joel Sherrill, Ph.D. Director of Research & Development [EMAIL PROTECTED]On-Line Applications Research Ask me about RTEMS: a free RTOS Huntsville AL 35805 Support Available (256) 722-9985
size of array "" is too large
Hello, I see that in x86 GCC, you can define a structure with struct trial { long a[10]; }; Whereas in a 16-bit target (picochip), you cannot define, struct trial { long a[1]; }; In the case above, i get a "size of array ‘a’ is too large" error. The thing that took me by surprise was, if i split the structure to struct trial { long a[5000]; long b[5000]; }; This works fine. I looked around the mailing list a bit. This issue seems to have been raised a few times before, but i couldnt find any definitive answer. Is this a bug in GCC? Do i file a report? Cheers Hari
unsigned comparison warning
Hello, I found something rather strange with the unsigned comparison warnings in GCC. If i had, unsigned char a; int foo () { if (a >= 0) return 0; else return 1; } and i did gcc -O2 -c trial.c, then i get a warning trial.c:6: warning: comparison is always true due to limited range of data type It works the same way if i used an unsigned short. But, if i use unsigned int/long, i dont get this warning. This is on x86. Is there an explanation for this? Cheers Hari
Re: unsigned comparison warning
Thanks Ian. I will raise this in gcc-help mailing list. Cheers Hari Ian Lance Taylor wrote: Hariharan <[EMAIL PROTECTED]> writes: I found something rather strange with the unsigned comparison warnings in GCC. This is the wrong mailing list. The mailing list gcc@gcc.gnu.org is for gcc developers. The mailing list [EMAIL PROTECTED] is for questions about using gcc. Please take any followups to [EMAIL PROTECTED] Thanks. and i did gcc -O2 -c trial.c, then i get a warning trial.c:6: warning: comparison is always true due to limited range of data type It works the same way if i used an unsigned short. But, if i use unsigned int/long, i dont get this warning. This is on x86. Is there an explanation for this? You neglected to mention the version of gcc. In current gcc, I don't see any warning when using "gcc -O2 -c trial.c". I see a warning for both "unsigned char" and "unsigned int" when I add the -Wextra option. Ian
fbranch-probabilities bug
Hi Seongbae, I was doing some work on profiling for picochip, when i noticed what looks to me like a bug. It looks to me that using fbranch-probabilities on the commandline (after a round of profile-generate or profile-arcs) would just not work on any target. Reason.. Coverage.c:1011 if (flag_profile_use) read_counts_file (); Should this not be if (flag_profile_use || flag_branch_probabilities) // Maybe more flags read_counts_file (); ?? Of course, i hit the problem later on since the counts were not read, it just assumed that the .gcda file were not available, when it actually was. Thanks Hari
pr39339 - invalid testcase or SRA bug?
Hi, Since r144598, pr39339.c has been failing on picochip. On investigation, it looks to me that the testcase is illegal. Relevant source code: struct C { unsigned int c; struct D { unsigned int columns : 4; unsigned int fore : 9; unsigned int back : 9; unsigned int fragment : 1; unsigned int standout : 1; unsigned int underline : 1; unsigned int strikethrough : 1; unsigned int reverse : 1; unsigned int blink : 1; unsigned int half : 1; unsigned int bold : 1; unsigned int invisible : 1; unsigned int pad : 1; } attr; }; struct A { struct C *data; unsigned int len; }; struct B { struct A *cells; unsigned char soft_wrapped : 1; }; struct E { long row, col; struct C defaults; }; __attribute__ ((noinline)) void foo (struct E *screen, unsigned int c, int columns, struct B *row) { struct D attr; long col; int i; col = screen->col; attr = screen->defaults.attr; attr.columns = columns; row->cells->data[col].c = c; row->cells->data[col].attr = attr; col++; attr.fragment = 1; for (i = 1; i < columns; i++) { row->cells->data[col].c = c; row->cells->data[col].attr = attr; col++; } } int main (void) { struct E e = {.row = 5,.col = 0,.defaults = {6, {-1, -1, -1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0}} }; struct C c[4]; struct A a = { c, 4 }; struct B b = { &a, 1 }; struct D d; __builtin_memset (&c, 0, sizeof c); foo (&e, 65, 2, &b); d = e.defaults.attr; d.columns = 2; if (__builtin_memcmp (&d, &c[0].attr, sizeof d)) __builtin_abort (); d.fragment = 1; if (__builtin_memcmp (&d, &c[1].attr, sizeof d)) __builtin_abort (); return 0; } In picochip, PCC_BITFIELD_TYPE_MATTERS is set and int is 16-bits, so the structure D becomes 6 bytes, with 3-bit padding between fore and back. At SRA the code becomes ;; Function foo (foo) foo (struct E * screen, unsigned int c, int columns, struct B * row) { unsigned int attr$B32F16; attr$B26F6; attr$back; attr$fore; attr$fragment; int i; long int col; struct C * D.1267; unsigned int D.1266; unsigned int D.1265; struct C * D.1264; struct A * D.1263; D.1262; unsigned char D.1261; : col_4 = screen_3(D)->col; attr$B32F16_36 = BIT_FIELD_REF defaults.attr, 16, 32>; attr$B26F6_37 = BIT_FIELD_REF defaults.attr, 6, 26>; attr$back_38 = screen_3(D)->defaults.attr.back; attr$fore_39 = screen_3(D)->defaults.attr.fore; attr$fragment_40 = screen_3(D)->defaults.attr.fragment; D.1261_6 = (unsigned char) columns_5(D); D.1262_7 = () D.1261_6; D.1263_9 = row_8(D)->cells; D.1264_10 = D.1263_9->data; D.1265_11 = (unsigned int) col_4; D.1266_12 = D.1265_11 * 8; D.1267_13 = D.1264_10 + D.1266_12; D.1267_13->c = c_14(D); BIT_FIELD_REF attr, 16, 32> = attr$B32F16_36; BIT_FIELD_REF attr, 6, 26> = attr$B26F6_37; D.1267_13->attr.back = attr$back_38; D.1267_13->attr.fore = attr$fore_39; D.1267_13->attr.fragment = attr$fragment_40; D.1267_13->attr.columns = D.1262_7; col_20 = col_4 + 1; if (columns_5(D) > 1) goto ; else goto ; : # col_29 = PHI # i_30 = PHI D.1265_24 = (unsigned int) col_29; D.1266_25 = D.1265_24 * 8; D.1267_26 = D.1264_10 + D.1266_25; D.1267_26->c = c_14(D); BIT_FIELD_REF attr, 16, 32> = attr$B32F16_36; BIT_FIELD_REF attr, 6, 26> = attr$B26F6_37; D.1267_26->attr.back = attr$back_38; D.1267_26->attr.fore = attr$fore_39; D.1267_26->attr.fragment = 1; D.1267_26->attr.columns = D.1262_7; col_32 = col_29 + 1; i_33 = i_30 + 1; if (columns_5(D) > i_33) goto ; else goto ; : return; } ;; Function main (main) main () { struct D d; struct B b; struct A a; struct C c[4]; struct E e; int D.1279; int D.1276; : e.row = 5; e.col = 0; e.defaults.c = 6; e.defaults.attr.columns = 15; e.defaults.attr.fore = 511; e.defaults.attr.back = 511; e.defaults.attr.fragment = 1; e.defaults.attr.standout = 0; e.defaults.attr.underline = 1; e.defaults.attr.strikethrough = 0; e.defaults.attr.reverse = 1; e.defaults.attr.blink = 0; e.defaults.attr.half = 1; e.defaults.attr.bold = 0; e.defaults.attr.invisible = 1; e.defaults.attr.pad = 0; a.data = &c; a.len = 4; b.cells = &a; b.soft_wrapped = 1; __builtin_memset (&c, 0, 32); foo (&e, 65, 2, &b); d = e.defaults.attr; d.columns = 2; D.1276_1 = __builtin_memcmp (&d, &c[0].attr, 6); if (D.1276_1 != 0) goto ; else goto ; : __builtin_abort (); : d.fragment = 1; D.1279_2 = __builtin_memcmp (&d, &c[1].attr, 6); if (D.1279_2 != 0) goto ; else goto ; : __builtin_abort (); : return 0; } Note that padding bits (13,16) are not copied over in bb_2 in function foo. main then does a memcmp, which fails because the padding bits are different. From C99 standards (p328), 265) The contents of ‘‘holes’’ used as padding for purposes of alignment within structure objects are indeterminate. Strings short
Re: pr39339 - invalid testcase or SRA bug?
Yes, if i change the structure to bring the 3 1-bit members forward, to avoid padding, the testcase does pass. Thanks to both of you for your help. Cheers Hari Jakub Jelinek wrote: On Tue, Mar 10, 2009 at 01:44:11PM +, Hariharan Sandanagobalane wrote: Since r144598, pr39339.c has been failing on picochip. On investigation, it looks to me that the testcase is illegal. Relevant source code: struct C { unsigned int c; struct D { unsigned int columns : 4; unsigned int fore : 9; unsigned int back : 9; As the testcase fails with buggy (pre r144598) gcc and succeeds after even with: unsigned int fore : 12; unsigned int back : 6; instead of :9, :9, I think we could change it (does it succeed on picochip then)? Or move to gcc.dg/torture/ and run only on int32plus targets. Or add if (sizeof (int) != 4 || sizeof (struct D) != 4) return 0 to the beginning of main. Jakub
Re: fbranch-probabilities bug
Seongbae Park ??? ??? wrote: This is the intended behavior, though now I see that the documentation isn't very clear. Can you fix the documentation? As it stands now, it is easy for a user to be misguided into thinking -fprofile-arcs and fbranch-probabilities combination would work. Just out of curiosity, What is the downside to letting people use -fbranch-probabilities without -fprofile-use? Cheers Hari You need to use -fprofile-use - the typical usage scenario is to compile with -fprofile-generate to build an executable to do profile collection, and then compile with -fprofile-use to build optimized code using the profile data. Seongbae On Thu, Jan 8, 2009 at 6:30 AM, Hariharan Sandanagobalane wrote: Hi Seongbae, I was doing some work on profiling for picochip, when i noticed what looks to me like a bug. It looks to me that using fbranch-probabilities on the commandline (after a round of profile-generate or profile-arcs) would just not work on any target. Reason.. Coverage.c:1011 if (flag_profile_use) read_counts_file (); Should this not be if (flag_profile_use || flag_branch_probabilities) // Maybe more flags read_counts_file (); ?? Of course, i hit the problem later on since the counts were not read, it just assumed that the .gcda file were not available, when it actually was. Thanks Hari
Re: Machine description question
Thanks for your help BingFeng. I gave this a go and ended up with worse code (and worse memory usage) than before. I started with this experiment because of the compilers "All virtual registers are assumed to be used and clobbered by unspec_volatile" rule. The get/put instructions read/write to registers and the virtual register assigned for them interferes with all the virtual registers in the function. So, they were highly likely to be spilled and use stack instead. I wanted to try to avoid this by the introduction of unspec's and use of imaginary registers. But, the virtual registers that are involved in unspec patterns with these imaginary registers still seem to be marked to interfere with all the virtual registers. Is that to be expected? Am i missing something obvious here? Regards Hari Bingfeng Mei wrote: Our architecture has the similar resource, and we use the first approach by creating an imaginary register and dependency between these instructions, i.e., every such instruction reads and write to the special register to create artificial dependency. You may need to add a (unspec:..) as an independent expression in your pattern to prevent some wrong optimizations. Cheers, Bingfeng -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Hariharan Sent: 12 May 2010 11:18 To: gcc@gcc.gnu.org Subject: Machine description question Hello all, Picochip has communication instructions that allow one array element to pass data to another. There are 3 such instructions PUT/GET/TSTPORT. Currently, all three of these use UNSPEC_VOLATILE side-effect expressions to make sure they don't get reordered. But, i wonder if it is an overkill to use UNSPEC_VOLATILE for this purpose and whether i should use UNSPEC instead. The only thing we care here is that they don't reordered with respect to each other. It is okay for other instructions to move around the communication instructions (as long as normal scheduler dependencies are taken care of). There are possibly one of two things i can do. 1. Introduce an implicit dependency between all communication instructions by adding a use/clobber of an imaginary register. 2. Introduce explicit dependency between them by using some target hook to add dependency links. I have not found any appropriate target hook to do this. Can you tell me which one i should try? Has anyone tried doing anything similar? Any pointers/suggestions on this will be greatly appreciated. Thanks Hari
Re: Machine description question
The patterns for PUT/GET were ; Scalar Put instruction. (define_insn "commsPut" [(unspec_volatile [(match_operand:HI 0 "const_int_operand" "") (match_operand:SI 1 "register_operand" "r")] UNSPEC_PUT)] "" "PUT %R1,%0\t// PORT(%0) := %R1" [(set_attr "type" "comms") (set_attr "length" "2")]) (define_insn "commsGet" [(set (match_operand:SI 0 "register_operand" "=r") (unspec_volatile:SI [(match_operand:HI 1 "immediate_operand" "n")] UNSPEC_GET))] "" "GET %1,%R0\t// %R0 := PORT(%1)" [(set_attr "type" "comms") (set_attr "length" "2")]) I changed them to ; Scalar Put instruction. (define_insn "commsPut" [(unspec [(match_operand:HI 0 "const_int_operand" "") (match_operand:SI 1 "register_operand" "r")] UNSPEC_PUT) (use (reg:HI DUMMY_COMMN_REGNUM)) (clobber (reg:HI DUMMY_COMMN_REGNUM))] "" "PUT %R1,%0\t// PORT(%0) := %R1" [(set_attr "type" "comms") (set_attr "length" "2")]) ; Simple scalar get. (define_insn "commsGet" [(set (match_operand:SI 0 "register_operand" "=r") (unspec:SI [(match_operand:HI 1 "immediate_operand" "n")] UNSPEC_GET)) (use (reg:HI DUMMY_COMMN_REGNUM)) (clobber (reg:HI DUMMY_COMMN_REGNUM))] "" "GET %1,%R0\t// %R0 := PORT(%1)" [(set_attr "type" "comms") (set_attr "length" "2")]) As for the DUMMY_COMMN_REGNUM, I just defined this as a FIXED_REGISTER and bumped up FIRST_PSUEDO_REG. Actually, there is one more problem i faced (other than performance). The code generated using unspec's was just plain wrong. The unspec pattern that i was using for GET, which was inside a loop, was being hoisted out of the loop by the loop optimizer. I guess i should have seen this coming, since unspec is just "machine-specific" operation and the optimizer probably rightly assumes that multiple execution of this with same parameters would result in same value being produced. This obviously is not the case for these communication instructions. Do you have your code to do this using unspec in gcc mainline? Can you point me to that, please? Thanks Hari Bingfeng Mei wrote: How do you define your imaginary register in target.h? Can you post one example of your instruction pattern? Bingfeng -Original Message- From: Hariharan Sandanagobalane [mailto:harihar...@picochip.com] Sent: 12 May 2010 16:40 To: Bingfeng Mei Cc: gcc@gcc.gnu.org Subject: Re: Machine description question Thanks for your help BingFeng. I gave this a go and ended up with worse code (and worse memory usage) than before. I started with this experiment because of the compilers "All virtual registers are assumed to be used and clobbered by unspec_volatile" rule. The get/put instructions read/write to registers and the virtual register assigned for them interferes with all the virtual registers in the function. So, they were highly likely to be spilled and use stack instead. I wanted to try to avoid this by the introduction of unspec's and use of imaginary registers. But, the virtual registers that are involved in unspec patterns with these imaginary registers still seem to be marked to interfere with all the virtual registers. Is that to be expected? Am i missing something obvious here? Regards Hari Bingfeng Mei wrote: Our architecture has the similar resource, and we use the first approach by creating an imaginary register and dependency between these instructions, i.e., every such instruction reads and write to the special register to create artificial dependency. You may need to add a (unspec:..) as an independent expression in your pattern to prevent some wrong optimizations. Cheers, Bingfeng -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Hariharan Sent: 12 May 2010 11:18 To: gcc@gcc.gnu.org Subject: Machine description question Hello all, Picochip has communication instructions that allow one array element to pass data to another. There are 3 such instructions PUT/GET/TSTPORT. Currently, all three of these use UNSPEC_VOLATILE side-effect expressions to make sure they don't get reordered. But, i wonder if it is an overkill to use UNSPEC_VOLATILE for this purpose and whether i should use UNSPEC instead. The only thing we care here is that they don't
Re: Machine description question
Hi Bengfeng, Changing my instruction patterns similar to the ones that you sent does get over the correctness issue. Setting the imaginary register explicitly this way and adding those extra unspec patterns does seem to work. But, performance-wise, it still doesn't give me anything. Did you decide to use these patterns (instead of the simpler unspec_volatile ones) for performance reasons? Does using these patterns give you anything? Cheers Hari Bingfeng Mei wrote: Hari, Here are some patterns similar to yours. (define_insn "putbx" [(set (reg:BXBC R_BX) (unspec:BXBC [(match_operand:QI 0 "firepath_register" "vr")] UNSPEC_BXM)) (unspec:BXBC [(reg:BXBC R_BX)] UNSPEC_BX)] <--- Important to avoid some wrong optimization (Maybe DCE, I couldn't remember clearly) define_insn "getbx" [(set (reg:BXBC R_BX) (unspec:BXBC [(reg:BXBC R_BX)] UNSPEC_BX)) < Artifical dependency (set (match_operand:SI 0 "register_operand" "=r") (unspec:SI [(reg:BXBC R_BX)]UNSPEC_BXM)) (unspec:BXBC [(reg:BXBC R_BX)] UNSPEC_BX)] < Important to avoid some optimization. Our port is still porivate and not in mainline. Cheers, Bingfeng -Original Message- From: Hariharan Sandanagobalane [mailto:harihar...@picochip.com] Sent: 13 May 2010 10:17 To: Bingfeng Mei Cc: gcc@gcc.gnu.org Subject: Re: Machine description question The patterns for PUT/GET were ; Scalar Put instruction. (define_insn "commsPut" [(unspec_volatile [(match_operand:HI 0 "const_int_operand" "") (match_operand:SI 1 "register_operand" "r")] UNSPEC_PUT)] "" "PUT %R1,%0\t// PORT(%0) := %R1" [(set_attr "type" "comms") (set_attr "length" "2")]) (define_insn "commsGet" [(set (match_operand:SI 0 "register_operand" "=r") (unspec_volatile:SI [(match_operand:HI 1 "immediate_operand" "n")] UNSPEC_GET))] "" "GET %1,%R0\t// %R0 := PORT(%1)" [(set_attr "type" "comms") (set_attr "length" "2")]) I changed them to ; Scalar Put instruction. (define_insn "commsPut" [(unspec [(match_operand:HI 0 "const_int_operand" "") (match_operand:SI 1 "register_operand" "r")] UNSPEC_PUT) (use (reg:HI DUMMY_COMMN_REGNUM)) (clobber (reg:HI DUMMY_COMMN_REGNUM))] "" "PUT %R1,%0\t// PORT(%0) := %R1" [(set_attr "type" "comms") (set_attr "length" "2")]) ; Simple scalar get. (define_insn "commsGet" [(set (match_operand:SI 0 "register_operand" "=r") (unspec:SI [(match_operand:HI 1 "immediate_operand" "n")] UNSPEC_GET)) (use (reg:HI DUMMY_COMMN_REGNUM)) (clobber (reg:HI DUMMY_COMMN_REGNUM))] "" "GET %1,%R0\t// %R0 := PORT(%1)" [(set_attr "type" "comms") (set_attr "length" "2")]) As for the DUMMY_COMMN_REGNUM, I just defined this as a FIXED_REGISTER and bumped up FIRST_PSUEDO_REG. Actually, there is one more problem i faced (other than performance). The code generated using unspec's was just plain wrong. The unspec pattern that i was using for GET, which was inside a loop, was being hoisted out of the loop by the loop optimizer. I guess i should have seen this coming, since unspec is just "machine-specific" operation and the optimizer probably rightly assumes that multiple execution of this with same parameters would result in same value being produced. This obviously is not the case for these communication instructions. Do you have your code to do this using unspec in gcc mainline? Can you point me to that, please? Thanks Hari Bingfeng Mei wrote: How do you define your imaginary register in target.h? Can you post one example of your instruction pattern? Bingfeng -Original Message- From: Hariharan Sandanagobalane [mailto:harihar...@picochip.com] Sent: 12 May 2010 16:40 To: Bingfeng Mei Cc: gcc@gcc.gnu.org Subject: Re: Machine description question Thanks for your help BingFeng. I gave this a go and ended up with worse code (and worse memory usage) than before. I started with this experiment because of the compilers "All virtual registers are assumed to be used and clobbered by unspec_volatile" rule. The get/put instructions read/write to registers and the virtual register assigned for them interferes with all the virtual registers in the function. So, they were highly likely to be spilled and use stack inst
Re: Machine description question
Ours is a vliw processor too, but my focus was on register allocation. Unfortunately, the instruction with unspec is still marked to interfere with all virtual registers and hence gets spilled. I was hoping the one with unspecs might do better there, but no change there. So, i end up with similar performance to the unspec_volatile version. Thanks for your help Cheers Hari Bingfeng Mei wrote: Yes, we use this instead of unspec_volatile out of performance concern. Our target is a VLIW processor, so there is more opportunities to move instructions around. Did you observe any instruction that should be moved but not? Cheers, Bingfeng -Original Message- From: Hariharan Sandanagobalane [mailto:harihar...@picochip.com] Sent: 14 May 2010 12:26 To: Bingfeng Mei Cc: gcc@gcc.gnu.org Subject: Re: Machine description question Hi Bengfeng, Changing my instruction patterns similar to the ones that you sent does get over the correctness issue. Setting the imaginary register explicitly this way and adding those extra unspec patterns does seem to work. But, performance-wise, it still doesn't give me anything. Did you decide to use these patterns (instead of the simpler unspec_volatile ones) for performance reasons? Does using these patterns give you anything? Cheers Hari Bingfeng Mei wrote: Hari, Here are some patterns similar to yours. (define_insn "putbx" [(set (reg:BXBC R_BX) (unspec:BXBC [(match_operand:QI 0 "firepath_register" "vr")] UNSPEC_BXM)) (unspec:BXBC [(reg:BXBC R_BX)] UNSPEC_BX)] <--- Important to avoid some wrong optimization (Maybe DCE, I couldn't remember clearly) define_insn "getbx" [(set (reg:BXBC R_BX) (unspec:BXBC [(reg:BXBC R_BX)] UNSPEC_BX)) < Artifical dependency (set (match_operand:SI 0 "register_operand" "=r") (unspec:SI [(reg:BXBC R_BX)]UNSPEC_BXM)) (unspec:BXBC [(reg:BXBC R_BX)] UNSPEC_BX)] < Important to avoid some optimization. Our port is still porivate and not in mainline. Cheers, Bingfeng -Original Message- From: Hariharan Sandanagobalane [mailto:harihar...@picochip.com] Sent: 13 May 2010 10:17 To: Bingfeng Mei Cc: gcc@gcc.gnu.org Subject: Re: Machine description question The patterns for PUT/GET were ; Scalar Put instruction. (define_insn "commsPut" [(unspec_volatile [(match_operand:HI 0 "const_int_operand" "") (match_operand:SI 1 "register_operand" "r")] UNSPEC_PUT)] "" "PUT %R1,%0\t// PORT(%0) := %R1" [(set_attr "type" "comms") (set_attr "length" "2")]) (define_insn "commsGet" [(set (match_operand:SI 0 "register_operand" "=r") (unspec_volatile:SI [(match_operand:HI 1 "immediate_operand" "n")] UNSPEC_GET))] "" "GET %1,%R0\t// %R0 := PORT(%1)" [(set_attr "type" "comms") (set_attr "length" "2")]) I changed them to ; Scalar Put instruction. (define_insn "commsPut" [(unspec [(match_operand:HI 0 "const_int_operand" "") (match_operand:SI 1 "register_operand" "r")] UNSPEC_PUT) (use (reg:HI DUMMY_COMMN_REGNUM)) (clobber (reg:HI DUMMY_COMMN_REGNUM))] "" "PUT %R1,%0\t// PORT(%0) := %R1" [(set_attr "type" "comms") (set_attr "length" "2")]) ; Simple scalar get. (define_insn "commsGet" [(set (match_operand:SI 0 "register_operand" "=r") (unspec:SI [(match_operand:HI 1 "immediate_operand" "n")] UNSPEC_GET)) (use (reg:HI DUMMY_COMMN_REGNUM)) (clobber (reg:HI DUMMY_COMMN_REGNUM))] "" "GET %1,%R0\t// %R0 := PORT(%1)" [(set_attr "type" "comms") (set_attr "length" "2")]) As for the DUMMY_COMMN_REGNUM, I just defined this as a FIXED_REGISTER and bumped up FIRST_PSUEDO_REG. Actually, there is one more problem i faced (other than performance). The code generated using unspec's was just plain wrong. The unspec pattern that i was using for GET, which was inside a loop, was being hoisted out of the loop by the loop optimizer. I guess i should have seen this coming, since unspec is just "machine-specific" operation and the optimizer probably rightly assumes that multiple execution of this with same parameters would result in same value being produced. This obviously is not the case for these communication instructions. Do you
Re: delay branch bug?
Jeff Law wrote: On 05/24/10 05:46, Hariharan wrote: Hello all, I found something a little odd with delay slot scheduling. If i had the following bit of code (Note that "get" builtin functions in picochip stand for port communication) int mytest () { int a[5]; int i; for (i = 0; i < 5; i++) { a[i] = (int) getctrlIn(); } switch (a[3]) { case 0: return 4; default: return 13; } } The relevant bit of assembly for this compiled at -Os is _L2: GET 0,R[5:4]// R[5:4] := PORT(0) _picoMark_LBE5= _picoMark_LBE4= .loc 1 13 0 STW R4,(R3)0// Mem((R3)0{byte}) := R4 ADD.0 R3,2,R3 // R3 := R3 + 2 (HI) .loc 1 11 0 SUB.0 R3,R2,r15 // CC := (R3!=R2) BNE _L2 =-> LDW (FP)3,R5// R5 = Mem((FP)6{byte}) .loc 1 22 0 =-> is the delay slot marker. Note that the LDW instruction has been moved into the delay slot. This corresponds to the load in "switch (a[3]" statement above. The first 3 times around this loop, LDW would be loading uninitialised memory. The loaded value is ignored until we come out of the loop and hence the code is functionally correct, but i am not sure introduction of uninitialised memory access by the compiler when there was none in the source is good. I browsed around the delay branch code in reorg.c, but couldn't find anything that checks for this. Is this the intended behaviour? Can anyone familiar with delay branch code help? It's not ideal, but there's no way for reorg to know that a particular memory location is uninitialized as a result trying to "fix" this problem would ultimately result in reorg not being allowed to fill delay slots with memory references except under very very restrictive circumstances. From a correctness standpoint, the uninitialized value will never be used, so it should cause no ill effects on your code. The biggest effect would be tools like valgrind & purify (if supported on your architecture) would report the uninitialized memory read. [Which begs the question how does purify handle this on sparc-solaris? ] The code compiled for picochip runs under a simulator. The simulator tracks uninitialised memory accesses and emits warnings and hence my question. I would agree with you that turning off delay slot filling of memory references for this sake doesn't make sense. Thanks for your help. Cheers Hari Jeff Thanks Hari
GCC vector extensions
Hello all, Is it possible to use rtl vector patterns like vec_extractm, vec_setm from C code? It looks like C subscipting for vector variables was allowed at some point and then removed. So, can these rtl patterns only be used from languages other than C? Of course, i can use these in target builtins, but i am trying to see if these can be used by language constructs itself. Cheers Hari PS: I raised a related question in http://gcc.gnu.org/ml/gcc-help/2010-11/msg00021.html.
Re: GCC vector extensions
Hi Ian, Thanks for your help. I switched to mainline and the vector extract works a treat. When i tried vector set, it was still generating suboptimal code. Is this bit of code still work in progress? Cheers Hari On 04/11/10 19:23, Ian Lance Taylor wrote: Hariharan Sandanagobalane writes: Is it possible to use rtl vector patterns like vec_extractm, vec_setm from C code? It looks like C subscipting for vector variables was allowed at some point and then removed. So, can these rtl patterns only be used from languages other than C? They were just recently added and have not been removed. Also answered on gcc-help. Ian
Steering Committee
Dear SC members, I used to maintain the picochip port of GCC, but I have not been active on the picochip port over the last 8 months. This is unlikely to change in the future, so I would like my name to be removed from the maintainers list as picochip maintainer. I am still actively working on GCC, so I would like to be added to the "Write after approval" list. Thanks Hari
Stack parameter - pass by value - frame usage
Hello, I looked at an inefficient code sequence for a simple program using GCC's picochip port (not yet submitted to mainline). Basically, a program like long carray[10]; void fn (long c, int i) { carray[i] = c; } produces good assembly code. But, if i were to do struct complex16 { int re,im; }; struct complex16 carray[10]; void fn (struct complex16 c, int i) { carray[i] = c; } GCC generates poor code. It has an extra save and restore of the frame-pointer, even though we dont use the frame. I digged a bit further, and found that the get_frame_size() call returns 4 in this case and hence the port's prologue generation code generates the frame-pointer updation. It seems to me that each element of the stack is copied to the stack from the parameter registers and then that value is being used in the function. I have the following RTL code as we get into RTL. (insn 6 2 7 2 (set (reg:HI 26) (reg:HI 0 R0 [ c ])) -1 (nil) (nil)) (insn 7 6 10 2 (set (reg:HI 27) (reg:HI 1 R1 [ c+2 ])) -1 (nil) (nil)) (insn 10 7 8 2 (set (reg/v:HI 28 [ i ]) (reg:HI 2 R2 [ i ])) -1 (nil) (nil)) (insn 8 10 9 2 (set (mem/s/c:HI (reg/f:HI 21 virtual-stack-vars) [3 c+0 S2 A16]) (reg:HI 26)) -1 (nil) (nil)) (insn 9 8 11 2 (set (mem/s/c:HI (plus:HI (reg/f:HI 21 virtual-stack-vars) (const_int 2 [0x2])) [3 c+2 S2 A16]) (reg:HI 27)) -1 (nil) (nil)) Note that the parameter is being written to the frame in the last 2 instructions above. This, i am guessing is the reason for the get_frame_size() returning 4 later on, though the actual save of the struct parameter value on the stack is being eliminated at later optimization phases (CSE and DCE, i believe). Why does the compiler do this? I vaguely remember x86 storing all parameter values on stack. Is that the reason for this behaviour? Is there anything i can do in the port to get around this problem? Note : In our port "int" is 16-bits and long is 32-bits. Thanks in advance, Regards Hari
Re: Stack parameter - pass by value - frame usage
Ian Lance Taylor wrote: Hariharan Sandanagobalane <[EMAIL PROTECTED]> writes: I looked at an inefficient code sequence for a simple program using GCC's picochip port (not yet submitted to mainline). Are you working with mainline sources? I was not. I tried the same with gcc 4.3 branch and it does fix most of the problems. There are still corner cases where it produces unoptimal code. I will try to figure out whats wrong in those and get back to you. Regards Hari Note that the parameter is being written to the frame in the last 2 instructions above. This, i am guessing is the reason for the get_frame_size() returning 4 later on, though the actual save of the struct parameter value on the stack is being eliminated at later optimization phases (CSE and DCE, i believe). Why does the compiler do this? I vaguely remember x86 storing all parameter values on stack. Is that the reason for this behaviour? Is there anything i can do in the port to get around this problem? At a guess, it's because the frontend decided that the struct was addressable and needed to be pushed on the stack. I thought this got cleaned up recently, though. Ian This email and any files transmitted with it are confidential and intended solely for the use of the individuals to whom they are addressed. If you have received this email in error please notify the sender and delete the message from your system immediately.
Profile information - CFG
Hello, I am implementing support for PBO on picochip port of GCC (not yet submitted to mainline). I see that GCC generates 2 files, xx.gcno and xx.gcda, containing the profile information, the former containing the flow graph information(compile-time) and later containing the edge profile information(run-time). The CFG information seems to be getting emitted quite early in the compilation process(pass_tree_profile). Is the instrumentation also done at this time? If it is, as later phases change CFG, how is the instrumentation code sanity maintained? If it isnt, How would you correlate the CFG in gcno file to the actual CFG at execution(that produces the gcda file)? As for our port's case, we are already able to generate profile information using our simulator/hardware, and it is not-too-difficult for me to format that information into .gcno and .gcda files. But, i guess the CFG that i would have at runtime would be quite different from the CFG at initial phases of compilation (even at same optimization level). Any suggestions on this? Would i be better off keeping the gcno file that GCC generates, try to match the runtime-CFG to the one on the gcno file and then write gcda file accordingly? Has anyone tried inserting profile information from outside of the GCC instrumentation back into the compiler? Could you please let me know how you handled this? In general, does anyone have any numbers on the performance improvements that PBO brings in for GCC? Thanks in advance. Regards Hari
Re: Profile information - CFG
Seongbae Park (???, ???) wrote: On 9/27/07, Hariharan Sandanagobalane <[EMAIL PROTECTED]> wrote: Hello, I am implementing support for PBO on picochip port of GCC (not yet submitted to mainline). I see that GCC generates 2 files, xx.gcno and xx.gcda, containing the profile information, the former containing the flow graph information(compile-time) and later containing the edge profile information(run-time). The CFG information seems to be getting emitted quite early in the compilation process(pass_tree_profile). Is the instrumentation also done at this time? If it is, as later phases change Yes. CFG, how is the instrumentation code sanity maintained? If it isnt, How Instrumentation code sanity is naturally maintained since those are global load/stores. The compiler transformations naturally preserve the original semantic of the input and since profile counters are global variables, update to those are preserved to provide what unoptimized code would do. would you correlate the CFG in gcno file to the actual CFG at execution(that produces the gcda file)? As for our port's case, we are already able to generate profile information using our simulator/hardware, and it is not-too-difficult for me to format that information into .gcno and .gcda files. But, i guess the CFG that i would have at runtime would be quite different from the CFG at initial phases of compilation (even at same optimization level). Any suggestions on this? Would i be better off keeping the gcno file that GCC generates, try to match the runtime-CFG to the one on the gcno file and then write gcda file accordingly? Not only better off, you *need* to provide information that matches what's in gcno, otherwise gcc can't read that gcda nor use it. How you match gcno is a different problem - there's no guarantee that you'll be able to recover enough information from the output assembly of gcc, because without instrumentation, gcc can optimize away the control flow. pass_tree_profile is when both the instrumentation (with -fprofile-generate) and reading of the profile data (with -fprofile-use) are done. The CFG has to remain the same between generate and use - otherwise the compiler isn't able to use the profile data. Thanks for your help, seongbae. I have managed to get the profile information formatted in the way .gcda would look. But, does GCC expect the profile to be accurate? Would it accept profile data that came out of sampling? -Hari Seongbae
VLIW scheduling and delayed branch
Hi, I am trying to enable delayed branch scheduling on our port of Gcc for picochip (16-bit VLIW DSP). I understand that delayed-branch is run as a seperate pass after the DFA scheduling is done. We basically depend on the TImode set on the cycle-start instructions to decide what instructions form a valid VLIW. By enabling delayed-branch, it seems like the delay-branch pass takes any instruction and puts it on the delay slot. This sometimes seem to pick the TImode set instructions, but does not seem to set the TImode on the next instruction. Has anyone faced a similar problem before? Are there targets for which both VLIW and DBR are enabled? Perhaps ia64? Thanks for your help. Regards Hari
Re: VLIW scheduling and delayed branch
Hi thomas, Thanks for your reply. A couple of questions below. Thomas Sailer wrote: Has anyone faced a similar problem before? Are there targets for which both VLIW and DBR are enabled? Perhaps ia64? I did something similar a few months ago. What was your target? Is the target code available in Gcc mainline? If not, could you pass your code to me? The problem is that haifa and the delayed branch scheduling passes don't really fit together. delayed branch scheduling happily undoes all the haifa decisions. The question is how much you gain by delayed branch scheduling. I don't have numbers, but it wasn't much in my case. And since your company name is picochip, you certainly value size more than speed ?! Yeah. We do. But, in our architecture, a branch has to have a delay slot instruction anyway. In the absence of one, we put a "nop" in there. If GCC manages to move a "single" instruction vliw into the delay slot, we would benefit in both size and speed, otherwise, we will just have no impact on either. I pursued two approaches. The first one was to insert "stop bit" pseudo insns into the RTL stream in machdep reorg, so I didn't have to rely on TImode insn flags during output. But then delayed branch scheduling just took one insn out of an insn group and put it into the delay slot, meaning there was usually no cycle gain at all, just larger code size (due to insn duplication). This seems fairly straightforward to implement. The second approach was having lots of parallel insns (using match parallel and a custom predicate). machdep reorg then converts insn bundles into a single parallel insn. Delayed branch scheduling then does the right thing. This approach works fairly well for me, but there are a few complications. My output code is pretty hackish, as I didn't want to duplicate outputing a single insn / outputing the same insn as component of a parallel insn group. When do you un-parallel those instructions? And, how? Regards Hari Tom
vliw scheduling - TImode bug?
Hello, I see quite a few instances when i get the following RTL. A conditional branch, followed by a BASIC_BLOCK note, followed by a non-TImode instruction. Theoretically, i should be allowed to package the non-TI instruction along with the conditional branch, but doing so seems to be produce incorrect results. Am i supposed to consider the NOTE_INSN_BASIC_BLOCK as a cycle-breaker? Or, is it a genuine bug in the way TImodes are set on instructions? (jump_insn:TI 144 225 17 2 /home/hariharans5/gcc-4.2.2/gcc/testsuite/gcc.c-torture/execute/931004-8.c:15 (parallel [ (set (pc) (if_then_else (le:HI (reg:CC 17 pseudoCC) (const_int 0 [0x0])) (label_ref 109) (pc))) (use (const_int 77 [0x4d])) ]) 10 {*branch} (nil) (expr_list:REG_DEAD (reg:CC 17 pseudoCC) (expr_list:REG_BR_PROB (const_int 500 [0x1f4]) (nil (note 17 144 124 3 [bb 3] NOTE_INSN_BASIC_BLOCK) (note 124 17 21 3 ("/home/hariharans5/gcc-4.2.2/gcc/testsuite/gcc.c-torture/execute/931004-8.c") 17) (insn 21 124 196 3 /home/hariharans5/gcc-4.2.2/gcc/testsuite/gcc.c-torture/execute/931004-8.c:15 (set (reg:HI 3 R3) (plus:HI (reg/f:HI 13 FP) (const_int 12 [0xc]))) 31 {*lea_move} (nil) (nil)) Thanks and regards Hari