Re: Modifying ARM code generator for elimination of 8bit writes - need help
On Fri, Jun 02, 2006 at 08:23:49AM +0200, Wolfgang Mües wrote: > Rask, > > (_only_ adding the clobber statement), > I get > >0/newlib/li bc/argz/argz_create_sep.c:60: error: unrecognizable insn: > (insn 192 21 24 0 (set (reg:QI 1 r1) (reg:QI 4 r4)) -1 > (nil) (nil)) > > What do you mean with > > > You will also have to modify any code which > > expands this pattern accordingly. The rest of the ARM backend presently assumes that the pattern has the form (set (operand:QI 0) (operand:QI 1)) but now we've changed it to (parallel [(set (operand:QI 0) (operand:QI 1)) (clobber (operand:QI 2)) ]) so that's why you get "unrecognizable insn" errors now. Any place which intended to generate an *arm_movqi_insn has to add a clobber also. For a start, this means the "movqi" pattern. There may be a faster way of seeing if the modification is going to work for the DS at all. I noticed from the output template "swp%?b\\t%1, %1, [%M0]" that "swp" takes three operands. I don't know ARM assembler, but you may be able to choose to always clobber a specific register. Make it a fixed register (see FIXED_REGISTERS), refer to this register directly in the output template and don't add a clobber to the movqi patterns. IMHO, that's an acceptable hack at an experimental stage. If the resulting code runs correctly on the DS, you can then undo the FIXED_REGISTERS change and add the clobber statements. -- Rask Ingemann Lambertsen
[patch] Improve loop array prefetch for IA-64
Hi, all This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS benchmark suite on Itanium-2 system, respectively. More performance increase is hopeful by further tuning the parameters and improving the prefetch algorithm at tree level. Details of NAS benchmarks are listed below. GCC options: -O3 -fprefetch-loop-arrays Target: Itanium-2 1.6GHz; L2 Cache 256K, L3 Cache 6M Execution times in seconds -this patch +this patch bt.W 14.4314.17 cg.A 13.766.86 ep.W 7.83 7.79 ft.A 18.7320.15 is.B 11.8510.94 lu.W 20.5520.27 mg.A 15.0911.86 sp.W 37.1135.49 geomean15.8413.94 speedup 13.68% 2006-06-02 Canqun Yang <[EMAIL PROTECTED]> * config/ia64/ia64.h (SIMULTANEOUS_PREFETCHES): Define to 18. (PREFETCH_BLOCK): Define to 128. (PREFETCH_LATENCY): Define to 400. Index: ia64.h === --- ia64.h (revision 114307) +++ ia64.h (working copy) @@ -1985,13 +1985,18 @@ ??? This number is bogus and needs to be replaced before the value is actually used in optimizations. */ -#define SIMULTANEOUS_PREFETCHES 6 +#define SIMULTANEOUS_PREFETCHES 18 /* If this architecture supports prefetch, define this to be the size of the cache line that is prefetched. */ -#define PREFETCH_BLOCK 32 +#define PREFETCH_BLOCK 128 +/* A number that should roughly corresponding to the nunmber of instructions + executed before the prefetch is completed. */ + +#define PREFETCH_LATENCY 400 + #define HANDLE_SYSV_PRAGMA 1 /* A C expression for the maximum number of instructions to execute via Canqun Yang __ 赶快注册雅虎超大容量免费邮箱? http://cn.mail.yahoo.com
Re: [wwwdocs] releases.html v/s develop.html
On Thu, 1 Jun 2006, Joe Buck wrote: > Let's just add the info to the table. Here is a proposed patch. > Note that I resorted by date and added an explanation. I think > that the attempt to sort by release number became increasingly > untenable after 3.4, because we now have heavy overlapping. Better > to just state it and explain it. Sounds good. I applied your patch right away and will take care of updating the release manager documentation over the weekend. Thanks! Gerald
Re: [patch] Improve loop array prefetch for IA-64
Canqun Yang wrote: Hi, all This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS benchmark suite on Itanium-2 system, respectively. More performance increase is hopeful by further tuning the parameters and improving the prefetch algorithm at tree level. Hi Canqun, It's great news that you continued to work on prefetching tuning for ia64! Do you plan to port your other changes for the old RTL prefetching to the tree level? @@ -1985,13 +1985,18 @@ ??? This number is bogus and needs to be replaced before the value is actually used in optimizations. */ I suggest to remove this comment as it has become outdated with your patch. Instead you might say how did you choose this particular value (and PREFETCH_BLOCK too). Just my 2c. Andrey
Re: [patch] Improve loop array prefetch for IA-64
On 6/2/06, Canqun Yang <[EMAIL PROTECTED]> wrote: This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS benchmark suite on Itanium-2 system, respectively. More performance increase is hopeful by further tuning the parameters and improving the prefetch algorithm at tree level. Bravo. --- ia64.h (revision 114307) +++ ia64.h (working copy) @@ -1985,13 +1985,18 @@ ??? This number is bogus and needs to be replaced before the value is actually used in optimizations. */ -#define SIMULTANEOUS_PREFETCHES 6 +#define SIMULTANEOUS_PREFETCHES 18 Is the number still bogus as the comment suggests, or is there a rationale for 18? It looks quite high. +/* A number that should roughly corresponding to the nunmber of instructions + executed before the prefetch is completed. */ + +#define PREFETCH_LATENCY 400 Likewise. Is 400 cycles the memory latency on itanium-2? Gr. Steven
Re: [patch] Improve loop array prefetch for IA-64
--- Andrey Belevantsev <[EMAIL PROTECTED]>: > Canqun Yang wrote: > > Hi, all > > > > This patch results a performance increase of 4% for SPECfp2000 and 13% for > > NAS benchmark suite > on > > Itanium-2 system, respectively. More performance increase is hopeful by > > further tuning the > > parameters and improving the prefetch algorithm at tree level. > > Hi Canqun, > > It's great news that you continued to work on prefetching tuning for > ia64! Do you plan to port your other changes for the old RTL > prefetching to the tree level? > Yes. But I have no much time to do it now. I am busy for other things. > > @@ -1985,13 +1985,18 @@ > > ??? This number is bogus and needs to be replaced before the value is > > actually used in optimizations. */ > > I suggest to remove this comment as it has become outdated with your > patch. Instead you might say how did you choose this particular value > (and PREFETCH_BLOCK too). Just my 2c. > > Andrey > > Please refer to my previous mail and attatched paper. Canqun Yang __ 赶快注册雅虎超大容量免费邮箱? http://cn.mail.yahoo.com
addressability checks in the gimplifier
Hello, First a short description of a problem we are seeing, then a couple of related questions on addressability checks in the gimplifier. >From a simple Ada testcase which I can provide if need be, the front-end is producing a MODIFY_EXPR with a lhs of the following shape when we get to gimplify_modify_expr: arg 1 bit offset arg 1 in short, a variable size array_range_ref within a bitfield record component. The lhs remains of similar shape after gimplification, the rhs is of variable size as well, and we end up at this point in gimplify_modify_expr: /* If we've got a variable sized assignment between two lvalues (i.e. does not involve a call), then we can make things a bit more straightforward by converting the assignment to memcpy or memset. */ if (TREE_CODE (*from_p) == WITH_SIZE_EXPR) { tree from = TREE_OPERAND (*from_p, 0); tree size = TREE_OPERAND (*from_p, 1); if (TREE_CODE (from) == CONSTRUCTOR) return gimplify_modify_expr_to_memset (expr_p, size, want_value); if (is_gimple_addressable (from)) { *from_p = from; return gimplify_modify_expr_to_memcpy (expr_p, size, want_value); } } We get down into gimplify_modify_expr_to_memcpy, which builds ADDR_EXPRs for both operands, which ICEs later on from expand_expr_addr_expr_1 because the operand sketched above is not byte-aligned. The first puzzle to me is that there is no check made that the target is a valid argument for an ADDR_EXPR. AFAICS, it has been gimplified with is_gimple_lvalue/fb_lvalue as the predicate/fallback pair, but this currently doesn't imply the required properties. I first thought that a is_gimple_addressable (*to_p) addition to the outer condition would help, but it actually does not because the predicate is shallow and only checks a very restricted set of conditions (e.g. any ARRAY_RANGE_REF or COMPONENT_REF is considered "addressable"). This is actually the reason why the gimplified lhs tree is considered is_gimple_lvalue, from: bool is_gimple_lvalue (tree t) { return (is_gimple_addressable (t) || TREE_CODE (t) == WITH_SIZE_EXPR /* These are complex lvalues, but don't have addresses, so they go here. */ || TREE_CODE (t) == BIT_FIELD_REF); Assuming that the initial tree is valid GENERIC, it would seem that a more sophisticated addressability checker (recursing down some inner refs and checking DECL_BIT_FIELD on field decls in COMPONENT_REFs) might be required. I'm unclear whether this could/should be is_gimple_addressable, as comments from http://gcc.gnu.org/ml/gcc/2004-07/msg01255.html indicate that it not designed for this sort of operation. I'm pretty sure I'm missing implicit assumptions and/or bits of design intents in various places, so would appreciate input on the case and puzzles described above. Thanks very much in advance for your help, With Kind Regards, Olivier
Re: Expansion of __builtin_frame_address
On Thu, 2006-06-01 at 16:05, Mark Shinwell wrote: > Mark Mitchell wrote: > > Mark Shinwell wrote: > >> As for the remaining problem, I suggest that we could: > >> > >> (i) always return the hard frame pointer, and disable FP elimination in > >> the current function; or > >> > >> (iii) ...the same as option (i), but allow targets to define another macro > >> that will cause the default code to use the soft frame pointer rather than > >> the hard frame pointer, and hence allow FP elimination. (If such a macro > >> were set by a particular target, am I right in thinking that it would be > >> safe to use the soft frame pointer even in the count >= 1 cases?) > > > >> I tend to think that option (iii) might be best, although perhaps it > >> is overkill and option (i) would do. But I'm not entirely sure; > >> still being a gcc novice I have to admit to not being quite thoroughly > >> clear on this myself at this stage. So any advice or comments would be > >> appreciated! > > > > I agree that option (iii) is best, as it provides the ability to > > optimize on platforms where that is feasible, and still provides a > > working default elsewhere. I will review and approve a suitable patch > > to implement (iii), assuming that there are no objections from Jim or > > others. > > This having been discussed some more, and my understanding improved, > I now believe that option (i) is in fact the correct thing to do. The > attach patch implements this, which basically amounts to the same logic > that is currently in the compiler save for the removal of the special > case when count == 0. > > OK for mainline? > I'm not keen on this. On some machines a frame pointer is *very* expensive, both in terms of the code required to set it up, and the resulting loss of a register which affects code quality (in addition, on Thumb, the frame pointer can access much less data on the stack than the stack pointer can, so code quality is affected even more). I can see no argument for a frame pointer being *required* for getting the return address. We didn't have to do this in the past, so I think it is wrong to require that we do it now. R. > Mark > > -- > > 2006-06-01 Mark Shinwell <[EMAIL PROTECTED]> > > * gcc/builtins.c (expand_builtin_return_addr): Always use > hard_frame_pointer_rtx and prevent frame pointer elimination > if INITIAL_FRAME_ADDRESS_RTX isn't set.
Re: Expansion of __builtin_frame_address
Richard Earnshaw wrote: I'm not keen on this. On some machines a frame pointer is *very* expensive, both in terms of the code required to set it up, and the resulting loss of a register which affects code quality (in addition, on Thumb, the frame pointer can access much less data on the stack than the stack pointer can, so code quality is affected even more). Do you have anything in mind that would be a better default? Is there something that could be used instead of hard_frame_pointer_rtx that will later expand to the correct frame address, but not necessarily force use of a frame pointer, for example? (As far as I can tell, frame_pointer_rtx will not do at least in the ARM case, because it doesn't yield the same value.) If the hard frame pointer is forced by default, then targets which are particularly badly affected can simply define INITIAL_FRAME_ADDRESS_RTX. Since such targets would presumably not have to force reload to keep the frame pointer, then definitions of such macros would not need to be side-effecting (in the way described earlier in this thread) and thus be satisfactory. I can see no argument for a frame pointer being *required* for getting the return address. We didn't have to do this in the past, so I think it is wrong to require that we do it now. Currently, the code does require a frame pointer in all except the count == 0 case, and as far as that particular case goes I get the impression that it would have been treated in the same way had this glibc backtrace function been noticed last year. This may be a mistaken impression though. Mark
Re: Expansion of __builtin_frame_address
> Mark Shinwell writes: Mark> If the hard frame pointer is forced by default, then targets which are Mark> particularly badly affected can simply define INITIAL_FRAME_ADDRESS_RTX. Mark> Since such targets would presumably not have to force reload to keep Mark> the frame pointer, then definitions of such macros would not need to Mark> be side-effecting (in the way described earlier in this thread) and thus Mark> be satisfactory. PowerPC also does not need hard_frame_pointer_rtx for all cases. It seems like a bad idea to force every port to define INITIAL_FRAME_ADDRESS_RTX to avoid a penalty. Why can't whatever port needs this change define INITIAL_FRAME_ADDRESS_RTX to hard_frame_pointer_rtx? One could add "count" to the macro so that the port can optimize further and avoid hard_frame_pointer_rtx, if possible. David
Re: [wwwdocs] releases.html v/s develop.html
On 6/2/06, Gerald Pfeifer <[EMAIL PROTECTED]> wrote: Mind to send/commit a patch to complete releases.html with 4.x releases and add a step to releasing.html? (Basically you just need to revert revision 1.26 of that file.) Joe Buck beat me to it and you applied it for him. Thanks to both of you. Thanks, Ranjit. -- Ranjit Mathew Email: rmathew AT gmail DOT com Bangalore, INDIA.Web: http://rmathew.com/
Re: Expansion of __builtin_frame_address
David Edelsohn wrote: Mark Shinwell writes: Mark> If the hard frame pointer is forced by default, then targets which are Mark> particularly badly affected can simply define INITIAL_FRAME_ADDRESS_RTX. Mark> Since such targets would presumably not have to force reload to keep Mark> the frame pointer, then definitions of such macros would not need to Mark> be side-effecting (in the way described earlier in this thread) and thus Mark> be satisfactory. PowerPC also does not need hard_frame_pointer_rtx for all cases. It seems like a bad idea to force every port to define INITIAL_FRAME_ADDRESS_RTX to avoid a penalty. Why can't whatever port needs this change define INITIAL_FRAME_ADDRESS_RTX to hard_frame_pointer_rtx? One could add "count" to the macro so that the port can optimize further and avoid hard_frame_pointer_rtx, if possible. OK, here is what I think is a better suggestion. First note that expand_builtin_return_addr is used for both __builtin_frame_address and __builtin_return_address. The behaviour for the return address case seems to be for target-specific code to override the result of this function in the case when count == 0; thus, it does indeed not matter what we return from expand_builtin_return_addr in that case. (I hadn't realised this before.) The new patch, below, thus has the same behaviour for __builtin_return_address. However when dealing with __builtin_frame_address, we must return the correct value from this function no matter what the value of count. This patch therefore forces use of a hard FP in such situations. Is that more satisfactory? Mark Index: builtins.c === --- builtins.c (revision 114325) +++ builtins.c (working copy) @@ -496,12 +496,16 @@ expand_builtin_return_addr (enum built_i #else rtx tem; - /* For a zero count, we don't care what frame address we return, so frame - pointer elimination is OK, and using the soft frame pointer is OK. - For a non-zero count, we require a stable offset from the current frame - pointer to the previous one, so we must use the hard frame pointer, and + /* For a zero count with __builtin_return_address, we don't care what + frame address we return, because target-specific definitions will + override us. Therefore frame pointer elimination is OK, and using + the soft frame pointer is OK. + + For a non-zero count, or a zero count with __builtin_frame_address, + we require a stable offset from the current frame pointer to the + previous one, so we must use the hard frame pointer, and we must disable frame pointer elimination. */ - if (count == 0) + if (count == 0 && fndecl_code == BUILT_IN_RETURN_ADDRESS) tem = frame_pointer_rtx; else {
Re: Expansion of __builtin_frame_address
On Fri, 2006-06-02 at 14:57, Mark Shinwell wrote: > Richard Earnshaw wrote: > > I'm not keen on this. On some machines a frame pointer is *very* > > expensive, both in terms of the code required to set it up, and the > > resulting loss of a register which affects code quality (in addition, on > > Thumb, the frame pointer can access much less data on the stack than the > > stack pointer can, so code quality is affected even more). > > Do you have anything in mind that would be a better default? Is there > something that could be used instead of hard_frame_pointer_rtx that > will later expand to the correct frame address, but not necessarily force > use of a frame pointer, for example? (As far as I can tell, > frame_pointer_rtx will not do at least in the ARM case, because it doesn't > yield the same value.) > Well in the past the ARM prologue code would copy the return address into a pseudo if the body of the function invoked __builtin_return_address, then the body of the code just used the psuedo. Somebody found a reason to change this, but I can't remember why. > If the hard frame pointer is forced by default, then targets which are > particularly badly affected can simply define INITIAL_FRAME_ADDRESS_RTX. > Since such targets would presumably not have to force reload to keep > the frame pointer, then definitions of such macros would not need to > be side-effecting (in the way described earlier in this thread) and thus > be satisfactory. > > > I can see no argument for a frame pointer being *required* for getting > > the return address. We didn't have to do this in the past, so I think > > it is wrong to require that we do it now. > > Currently, the code does require a frame pointer in all except the > count == 0 case, and as far as that particular case goes I get the > impression that it would have been treated in the same way had this > glibc backtrace function been noticed last year. This may be a mistaken > impression though. __builtin_frame_address(n) is not required to work for any value other than n=0. It's not clear what it means anyway on a function that eliminates the frame pointer. On ARM you *cannot* walk the stack frames without additional information. Frames are private to each function and there is no defined format for the layout. In particular the layout (and the frame pointer register) is different for ARM and Thumb code, but the two can be freely intermixed. The only chance you have for producing a backtrace() is to have unwinding information similar to that provided for exception unwinding. This would describe to the unwinder how that frames code is laid out so that it can unpick it. R.
Re: Expansion of __builtin_frame_address
On Fri, 2006-06-02 at 15:30, Mark Shinwell wrote: > However when dealing with __builtin_frame_address, we must return the > correct value from this function no matter what the value of count. This > patch therefore forces use of a hard FP in such situations. Eh? The manual explicitly says that __builtin_frame_address is only required to work if count=0. You simply cannot up walk arbitrary numbers of frames on some CPUs since code isn't compiled to support it. R.
RE: [patch] Improve loop array prefetch for IA-64
Canqun, Nice job getting this ready for the current version of gcc! Question: does gcc now know the difference between prefetching to cache L1 via "lfetch", as opposed to prefetching only to level L2 via "lfetch.nt1"? For floating point data, the latter is the only interesting case because float loads only access the L2. Thus using "lfetch" for floating point arrays will unnecessarily wipe out the contents of L1. (gcc 3.2.3 only seems to generate "lfetch", which is why I ask...) Thanks, Mark -Original Message- From: Canqun Yang [mailto:[EMAIL PROTECTED] Sent: Friday, June 02, 2006 5:14 AM To: gcc@gcc.gnu.org; [EMAIL PROTECTED] Subject: [patch] Improve loop array prefetch for IA-64 Hi, all This patch results a performance increase of 4% for SPECfp2000 and 13% for NAS benchmark suite on Itanium-2 system, respectively. More performance increase is hopeful by further tuning the parameters and improving the prefetch algorithm at tree level. Canqun Yang __ 赶快注册雅虎超大容量免费邮箱? http://cn.mail.yahoo.com
Re: Expansion of __builtin_frame_address
On Fri, Jun 02, 2006 at 04:20:21PM +0100, Richard Earnshaw wrote: > On Fri, 2006-06-02 at 15:30, Mark Shinwell wrote: > > > However when dealing with __builtin_frame_address, we must return the > > correct value from this function no matter what the value of count. This > > patch therefore forces use of a hard FP in such situations. > > Eh? The manual explicitly says that __builtin_frame_address is only > required to work if count=0. You simply cannot up walk arbitrary > numbers of frames on some CPUs since code isn't compiled to support it. Right - it's the result of __builtin_frame_address (0) we're looking at here. Mark's latest change seems logical to me: the user has asked for the frame address, so hadn't we better arrange that there's a frame? -- Daniel Jacobowitz CodeSourcery
Re: Expansion of __builtin_frame_address
> __builtin_frame_address(n) is not required to work for any value other > than n=0. It's not clear what it means anyway on a function that > eliminates the frame pointer. > > On ARM you *cannot* walk the stack frames without additional > information. Frames are private to each function and there is no > defined format for the layout. In particular the layout (and the frame > pointer register) is different for ARM and Thumb code, but the two can > be freely intermixed. > > The only chance you have for producing a backtrace() is to have > unwinding information similar to that provided for exception unwinding. > This would describe to the unwinder how that frames code is laid out so > that it can unpick it. I agree that in general you need ancillary information way to get a backtrace on Arm. However if you assume only Arm code code and -fno-omit-frame-pointer then you can walk the frames. Given this assumption I think it make sense to have _b_f_a(0) force the use of a frame pointer. If you're implementing "proper" unwinding then I'd say you want to use an assembly function stub to determine the initial frame, rather than relying on a rather ill-defined gcc builtin function. In other words __builtin_frame_address is effectively useless, so we may as well make it consistent with historical use than try to optimize it. Th background to this problem is we have a client who was upset when backtrace() "broke" with gcc4. For this particular client -marm -fno-omit-frame-pointer -mapcs-frame is an acceptable price to play for making backtrace() work. Paul
Re: Expansion of __builtin_frame_address
On Fri, 2006-06-02 at 16:28, Paul Brook wrote: > I agree that in general you need ancillary information way to get a backtrace > on Arm. However if you assume only Arm code code and -fno-omit-frame-pointer > then you can walk the frames. Given this assumption I think it make sense to > have _b_f_a(0) force the use of a frame pointer. > No, in the general case you can't. Because ARM and Thumb frames are laid out differently. In ARM code the frame pointer is in r11 (when not eliminated); in thumb code it is in r7 (because r11 can't be used in memory insns). R.
Re: Expansion of __builtin_frame_address
On Fri, Jun 02, 2006 at 04:32:07PM +0100, Richard Earnshaw wrote: > On Fri, 2006-06-02 at 16:28, Paul Brook wrote: > > I agree that in general you need ancillary information way to get a > > backtrace > > on Arm. However if you assume only Arm code code and > > -fno-omit-frame-pointer > > then you can walk the frames. Given this assumption I think it make sense > > to > > have _b_f_a(0) force the use of a frame pointer. > > > > No, in the general case you can't. Because ARM and Thumb frames are > laid out differently. In ARM code the frame pointer is in r11 (when not > eliminated); in thumb code it is in r7 (because r11 can't be used in > memory insns). I'm reading these two paragraphs and the two of you seem to be in violent agreement. Paul assumed ARM code only. -- Daniel Jacobowitz CodeSourcery
Bug in gnu assembler?
How to reproduce this problem - 1) Take some C file. I used for instance dwarf.c from the new binutils distribution. 2) Generate an assembler listing of this file 3) Using objdump -s dwarf.o I dump all the sections of the executable in hexadecimal format. Put the result of this dump in some file, I used "vv" as name. 4) Dump the contents of the eh_frame section in symbolic form. You should use readelf -W. Put the result in some file, say, "dwarf.framedump" --- OK Now let's start. I go to the assembly listing (dwarf.s) and search for "eh_frame" in the editor. I arrive at: .section.debug_frame,"",@progbits This section consists of a CIE (Common Information Entry in GNU terminology) that is generated as follows in the assembly listing .Lframe0: .long .LECIE0-.LSCIE0 .LSCIE0: .long 0x .byte 0x1 .string "" .uleb128 0x1 .sleb128 -8 .byte 0x10 .byte 0xc .uleb128 0x7 .uleb128 0x8 .byte 0x90 .uleb128 0x1 .align 8 .LECIE0: --- This corresponds to a symbolic listing like this: (file dwarf.framedump) The section .debug_frame contains: 0014 CIE Version: 1 Augmentation: "" Code alignment factor: 1 Data alignment factor: -8 Return address column: 16 DW_CFA_def_cfa: r7 ofs 8 DW_CFA_offset: r16 at cfa-8 DW_CFA_nop DW_CFA_nop DW_CFA_nop DW_CFA_nop DW_CFA_nop DW_CFA_nop This means that this entry starts at offset 0 and goes for 20+4 bytes (the length field is 4 bytes). Our binary dump of the contents of the first 96 bytes (0x60) looks like this: Contents of section .eh_frame: 1400 01000178 100c0708 ...x 0010 9001 1c00 1c00 0020 5900 Y... 0030 410e1083 0200 1c00 3c00 A...<... 0040 6800 h... 0050 410e1083 0200 1400 5c00 A...\... 0060 4e00 N... We eliminate the first 24 (0x18) bytes and we obtain: 0018 1c00 1c00 0020 5900 Y... 0030 410e1083 0200 1c00 3c00 A...<... The is a FDE or Frame description entry in GNU terminology. We have first a 32 bit length field represented by the difference LEFDZ0 - LASDFE0. This is 1c00 above Then we have another .long instruction, (32 bits) that corresponds to the second 1c00 above. Then we have two .quad instructions that correspond to the line 0020 5900 Y... above AND NOW IT BECOMES VERY INTERESTING: We have the instructions .byte0x4 .long.LCFI0 - .LFB50 .byte 0xe .uleb128 0x10 .byte 0x83 .uleb128 0x2 .align 8 And we find in the hexademical dump the line 0030 410e1083 0200 1c00 3c00 A...<... The 4 and the 1 are in the same byte, followed by the correct 0xe byte the correct 0x10 byte (uleb128 is 0x10) followed by the correct 0x83 and followed by the correctd 0x02 byte. WHERE AM I WRONG ? I am getting CRAZY with this Here is the full assembly listing of the FDE: .LSFDE0: .long .LEFDE0-.LASFDE0 /* first field 1c00 */ .LASFDE0: .long .Lframe0 .quad .LFB50 .quad .LFE50-.LFB50 .byte 0x4 .long .LCFI0-.LFB50 .byte 0xe .uleb128 0x10 .byte 0x83 .uleb128 0x2 .align 8
Re: Expansion of __builtin_frame_address
Richard Earnshaw wrote: > The only chance you have for producing a backtrace() is to have > unwinding information similar to that provided for exception unwinding. > This would describe to the unwinder how that frames code is laid out so > that it can unpick it. I'd suggest we leave backtrace() aside, and just talk about __builtin_frame_address(0), which does have well-defined semantics. _b_f_a(0) is currently broken on ARM, and we all agree we should fix it. I mildly disagree with David's comment that: > It seems like a bad idea to force every port to define > INITIAL_FRAME_ADDRESS_RTX to avoid a penalty. in that I think the default should be working code, and Mark's change accomplishes that. Of course, if _b_f_a(0) can be implemented more efficiently on some target, there should be a hook to do that. And, I think it's reasonable to ask Mark to go through and add that optimization to ports that already work that way, so that his patch doesn't regress any target. (I'm not actually sure how _b_f_a(0) works on other targets, but not on ARM.) But, scrapping about the default probably isn't very productive. The important thing is to work out how _b_f_a(0) can be made to work on ARM. Richard, I can't tell from your comments how you think _b_f_a(0) should be implemented on ARM. We could use Mark's logic (forcing a hard frame pointer), but stuff it into INITIAL_FRAME_ADDRESS_RTX. We could also try to reimplement the thing you mentioned about using a pseudo, though I guess we'd need to work out why that was thought a bad idea before. What option do you suggest? -- Mark Mitchell CodeSourcery [EMAIL PROTECTED] (650) 331-3385 x713
Re: Expansion of __builtin_frame_address
On Fri, 2006-06-02 at 16:35, Daniel Jacobowitz wrote: > On Fri, Jun 02, 2006 at 04:32:07PM +0100, Richard Earnshaw wrote: > > On Fri, 2006-06-02 at 16:28, Paul Brook wrote: > > > I agree that in general you need ancillary information way to get a > > > backtrace > > > on Arm. However if you assume only Arm code code and > > > -fno-omit-frame-pointer > > > then you can walk the frames. Given this assumption I think it make sense > > > to > > > have _b_f_a(0) force the use of a frame pointer. > > > > > > > No, in the general case you can't. Because ARM and Thumb frames are > > laid out differently. In ARM code the frame pointer is in r11 (when not > > eliminated); in thumb code it is in r7 (because r11 can't be used in > > memory insns). > > I'm reading these two paragraphs and the two of you seem to be in > violent agreement. Paul assumed ARM code only. Well, that's a pretty limiting assumption given that ARM and thumb code can be freely intermixed. Indeed, I've often wondered if -Os should default to Thumb code on CPUs that can support it (and thumb code can corrupt the ARM frame register since it doesn't consider it to be special in any way -- it's just a call-saved register). I've also pondered making the compiler ignore -f[no-]omit-frame-pointer and to only use one in cases where the stack is dynamically adjustable. R.
Re: Expansion of __builtin_frame_address
On Friday 02 June 2006 16:44, Richard Earnshaw wrote: > On Fri, 2006-06-02 at 16:35, Daniel Jacobowitz wrote: > > On Fri, Jun 02, 2006 at 04:32:07PM +0100, Richard Earnshaw wrote: > > > On Fri, 2006-06-02 at 16:28, Paul Brook wrote: > > > > I agree that in general you need ancillary information way to get a > > > > backtrace on Arm. However if you assume only Arm code code and > > > > -fno-omit-frame-pointer then you can walk the frames. Given this > > > > assumption I think it make sense to have _b_f_a(0) force the use of a > > > > frame pointer. > > > > > > No, in the general case you can't. Because ARM and Thumb frames are > > > laid out differently. In ARM code the frame pointer is in r11 (when > > > not eliminated); in thumb code it is in r7 (because r11 can't be used > > > in memory insns). > > > > I'm reading these two paragraphs and the two of you seem to be in > > violent agreement. Paul assumed ARM code only. > > Well, that's a pretty limiting assumption given that ARM and thumb code > can be freely intermixed. Indeed, I've often wondered if -Os should > default to Thumb code on CPUs that can support it (and thumb code can > corrupt the ARM frame register since it doesn't consider it to be > special in any way -- it's just a call-saved register). I've also > pondered making the compiler ignore -f[no-]omit-frame-pointer and to > only use one in cases where the stack is dynamically adjustable. Ok, let me put it a different way. How is __builtin_frame_address(0) useful if you don't make these assumptions, and what would it be used for? For the record I agree that __builtin_return_address(0) has use and should not force a frame pointer. Paul
Re: Expansion of __builtin_frame_address
On Fri, 2006-06-02 at 16:46, Mark Mitchell wrote: > Richard, I can't tell from your comments how you think _b_f_a(0) should > be implemented on ARM. We could use Mark's logic (forcing a hard frame > pointer), but stuff it into INITIAL_FRAME_ADDRESS_RTX. We could also > try to reimplement the thing you mentioned about using a pseudo, though > I guess we'd need to work out why that was thought a bad idea before. > What option do you suggest? I think I need to understand first what _b_f_a(0) would be used for. Until I understand that I can't really say how best it should be implemented. One _possible_ implementation that would be reasonable would be the dwarf CFA value for the function: but that's very different from both the current ARM r11 value or the Thumb r7 value in functions that use a frame register. However, it is well defined in both ARM and Thumb code. Note that in ARM code r11 points near to the top of the frame, but in Thumb code r7 points to the bottom of the frame (in gcc-4.2 or later, since you can't use negative offsets in memory addresses). R.
Re: Bug in gnu assembler?
On Jun 2, 2006, at 8:46 AM, jacob navia wrote: How to reproduce this problem - WHERE AM I WRONG ? You should write to [EMAIL PROTECTED] if you want a high probility of your question about the assembler being answered. -- Pinski
Re: GCC SC request about ecj
Richard stallman write last night: I agree to the use of the Eclipse front end to generate Java byte codes. Note this does not mean importing Eclispe code into the gcc source or release tree. We need to decide on a practical way to have people grab a compatible version of ecj. -- --Per Bothner [EMAIL PROTECTED] http://per.bothner.com/
Re: comparing DejaGNU results
I took a quick pass at implementing the comparisons in a more suitable lanugage. Run time is now a few seconds on both platforms. About the same as compare_tests on my old ibook/OSX and much faster on FC3. Trials show the same results as before. For anyone interested, the new version is attached. Jim. -- Jim Lemke [EMAIL PROTECTED] Orillia, Ontario dg-cmp-results.sh Description: application/shellscript
Re: GCC SC request about ecj
On Fri, Jun 02, 2006 at 10:59:58AM -0700, Per Bothner wrote: > Richard stallman write last night: > > I agree to the use of the Eclipse front end to generate > Java byte codes. > > Note this does not mean importing Eclispe code into the gcc source or > release tree. We need to decide on a practical way to have people > grab a compatible version of ecj. Treat it like GMP, I guess; it's an external dependency. Tell people where to get it; have configure test for its presence and refuse to build any dependencies if it isn't found.
gcc 4.1.1 build reports: 3 gnu/linux flavors, HP/UX, Solaris 2.8
Hi, Here are some gcc 4.1.1 build reports. #1: i686-pc-linux-gnu, Red Hat EL3: C, C++, ObjC, and Java. "Native" tools were used. Test results: http://gcc.gnu.org/ml/gcc-testresults/2006-06/msg00019.html #2: ia64-unknown-linux-gnu, Red Hat Advanced Workstation 2.1AW. C, C++, ObjC. binutils 2.16.1 were used. Test results: http://gcc.gnu.org/ml/gcc-testresults/2006-06/msg00065.html #3: x86_64-unknown-linux-gnu, Red Hat EL3: C, C++, ObjC, and Java. "Native" tools were used. Test results: http://gcc.gnu.org/ml/gcc-testresults/2006-06/msg00061.html #4: hppa2.0w-hp-hpux11.00, using GNU as version 2.15, native C compiler for bootstrap, native linker. C, C++, ObjC. Results: http://gcc.gnu.org/ml/gcc-testresults/2006-06/msg00122.html I had to install a new makeinfo to keep the build from bombing, even though the failure seems to be in fastjar, and Java won't build on this platform. #5: sparc-sun-solaris2.8; as and ld from binutils 2.16.1 Build failure while building the Java library. I'll send a separate message on this.
Solaris 2.8 build failure for 4.1.1 (libtool/libjava)
I haven't tried to build Java on Solaris in quite a while because it takes so long. My attempt to build on Solaris 2.8 with binutils 2.16.1 died with /bin/ksh ./libtool --tag=GCJ --mode=link /remote/atg2/jbuck/solaris.tmp/411/gcc/gcj -B/remote/atg2/jbuck/solaris.tmp/411/sparc-sun-solaris2.8/libjava/ -B/remote/atg2/jbuck/solaris.tmp/411/gcc/ -L/remote/atg2/jbuck/solaris.tmp/411/sparc-sun-solaris2.8/libjava -g -O2 -o jv-convert --main=gnu.gcj.convert.Convert -rpath /u/jbuck/cvs.sol2/4.1.1/lib -shared-libgcc -L/remote/atg2/jbuck/solaris.tmp/411/sparc-sun-solaris2.8/libjava/.libs libgcj.la /remote/atg2/jbuck/solaris.tmp/411/gcc/gcj -B/remote/atg2/jbuck/solaris.tmp/411/sparc-sun-solaris2.8/libjava/ -B/remote/atg2/jbuck/solaris.tmp/411/gcc/ -g -O2 -o .libs/jv-convert --main=gnu.gcj.convert.Convert -shared-libgcc -L/remote/atg2/jbuck/solaris.tmp/411/sparc-sun-solaris2.8/libjava -L/remote/atg2/jbuck/solaris.tmp/411/sparc-sun-solaris2.8/libjava/.libs ./.libs/libgcj.so -L/remote/atg2/jbuck/solaris.tmp/411/sparc-sun-solaris2.8/libstdc++-v3/src -L/remote/atg2/jbuck/solaris.tmp/411/sparc-sun-solaris2.8/libstdc++-v3/src/.libs -lpthread -lrt -ldl -L/remote/atg2/jbuck/solaris.tmp/411/./gcc -L/usr/ccs/lib -lgcc_s -lgcc_s -Wl,--rpath -Wl,/u/jbuck/cvs.sol2/4.1.1/lib /u/jbuck/gnu.sol2/bin/ld: unrecognized option '-Wl,-rpath' /u/jbuck/gnu.sol2/bin/ld: use the --help option for usage information collect2: ld returned 1 exit status It's GNU ld version 2.16.1. This is strange; I would have expected the linker to get just -rpath: -Wl should tell gcj to pass the following option to the linker. Any clues?
Re: [patch] Improve loop array prefetch for IA-64
On 6/2/06, Davis, Mark <[EMAIL PROTECTED]> wrote: Question: does gcc now know the difference between prefetching to cache L1 via "lfetch", as opposed to prefetching only to level L2 via "lfetch.nt1"? The ia64 backend knows the difference, see the prefetch pattern in ia64.md. But ia64 is the only backend that supports this kind of explicit locality parameter. And since no-one from the ia64 community cared much about gcc until recently, gcc's prefetching pass (which is limited anyway) does not generate lfetch.nt1 or other prefetches with explicit locality parameters. For floating point data, the latter is the only interesting case because float loads only access the L2. Thus using "lfetch" for floating point arrays will unnecessarily wipe out > the contents of L1. (gcc 3.2.3 only seems to generate "lfetch", which is why I ask...) You could experiment with this for ia64 by hacking issue_prefetch_ref in tree-ssa-loop-prefetch.c to issue a prefetch to L2 for floating point types. Gr. Steven
Re: [patch] Improve loop array prefetch for IA-64
On 6/3/06, Steven Bosscher <[EMAIL PROTECTED]> wrote: > For floating point data, the latter is the only interesting case because float loads only > access the L2. Thus using "lfetch" for floating point arrays will unnecessarily wipe out > the contents of L1. (gcc 3.2.3 only seems to generate "lfetch", which is why I ask...) You could experiment with this for ia64 by hacking issue_prefetch_ref in tree-ssa-loop-prefetch.c to issue a prefetch to L2 for floating point types. E.g. something like this, which is (needless to say) untested but something you could play with. Gr. Steven Index: tree-ssa-loop-prefetch.c === --- tree-ssa-loop-prefetch.c (revision 114315) +++ tree-ssa-loop-prefetch.c (working copy) @@ -816,7 +816,7 @@ static void issue_prefetch_ref (struct mem_ref *ref, unsigned unroll_factor, unsigned ahead) { HOST_WIDE_INT delta; - tree addr, addr_base, prefetch, params, write_p; + tree addr, addr_base, prefetch, params, write_p, locality; block_stmt_iterator bsi; unsigned n_prefetches, ap; @@ -838,11 +838,21 @@ issue_prefetch_ref (struct mem_ref *ref, addr_base, build_int_cst (ptr_type_node, delta)); addr = force_gimple_operand_bsi (&bsi, unshare_expr (addr), true, NULL); - /* Create the prefetch instruction. */ + /* Create the prefetch instruction. Do this by building a call to + `void __builtin_prefetch (const void *ADDR, int RW, int LOCALITY)'. + + ??? The `locality' parameter is a shameless, untested hack to + force lfetch.nt1 -- hopefully. */ write_p = ref->write_p ? integer_one_node : integer_zero_node; - params = tree_cons (NULL_TREE, addr, - tree_cons (NULL_TREE, write_p, NULL_TREE)); - + locality = FLOAT_TYPE_P (mem_ref->base) + ? integer_one_node : integer_zero_node; + params = tree_cons (NULL_TREE, + addr, + tree_cons (NULL_TREE, + write_p, + tree_cons (NULL_TREE, + locality, + NULL_TREE))); prefetch = build_function_call_expr (built_in_decls[BUILT_IN_PREFETCH], params); bsi_insert_before (&bsi, prefetch, BSI_SAME_STMT);
gcc-4.1-20060602 is now available
Snapshot gcc-4.1-20060602 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/4.1-20060602/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 4.1 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_1-branch revision 114329 You'll find: gcc-4.1-20060602.tar.bz2 Complete GCC (includes all of below) gcc-core-4.1-20060602.tar.bz2 C front end and core compiler gcc-ada-4.1-20060602.tar.bz2 Ada front end and runtime gcc-fortran-4.1-20060602.tar.bz2 Fortran front end and runtime gcc-g++-4.1-20060602.tar.bz2 C++ front end and runtime gcc-java-4.1-20060602.tar.bz2 Java front end and runtime gcc-objc-4.1-20060602.tar.bz2 Objective-C front end and runtime gcc-testsuite-4.1-20060602.tar.bz2The GCC testsuite Diffs from 4.1-20060526 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-4.1 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Re: comparing DejaGNU results
On Jun 2, 2006, at 11:08 AM, James Lemke wrote: I took a quick pass at implementing the comparisons in a more suitable lanugage. Run time is now a few seconds on both platforms. About the same as compare_tests on my old ibook/OSX and much faster on FC3. Since Ben and I seem interested in this, I think we should check in this version. It seems portable enough and useful enough. Any objections from the crowd?
RE: [patch] Improve loop array prefetch for IA-64
--- "Davis, Mark" <[EMAIL PROTECTED]>: > Canqun, > > Nice job getting this ready for the current version of gcc! > > Question: does gcc now know the difference between prefetching to cache L1 > via "lfetch", as > opposed to prefetching only to level L2 via "lfetch.nt1"? For floating point > data, the latter > is the only interesting case because float loads only access the L2. Thus > using "lfetch" for > floating point arrays will unnecessarily wipe out the contents of L1. (gcc > 3.2.3 only seems to > generate "lfetch", which is why I ask...) > Yes, GCC does. I have tried this on the old prefetch implementation at RTL level and the new one at TREE level, but no significant performance difference for SPECfp2000 and NAS benchmarks. Nevertheless, it worth taking more time to inspect it. Canqun Yang > Thanks, > Mark > > -Original Message- > From: Canqun Yang [mailto:[EMAIL PROTECTED] > Sent: Friday, June 02, 2006 5:14 AM > To: gcc@gcc.gnu.org; [EMAIL PROTECTED] > Subject: [patch] Improve loop array prefetch for IA-64 > > Hi, all > > This patch results a performance increase of 4% for SPECfp2000 and 13% for > NAS benchmark suite > on > Itanium-2 system, respectively. More performance increase is hopeful by > further tuning the > parameters and improving the prefetch algorithm at tree level. > > > Canqun Yang > > __ 赶快注册雅虎超大容量免费邮箱? http://cn.mail.yahoo.com
gen_lowpart vs big endian insv
h8300 has an HImode insv pattern. If you try to use it with an SImode argument, expmed.c uses gen_lowpart to force it into the desired mode. However, gen_lowpart eventually fails for pseudos on big endian: rtx gen_rtx_SUBREG (enum machine_mode mode, rtx reg, int offset) { gcc_assert (validate_subreg (mode, GET_MODE (reg), reg, offset)); return gen_rtx_raw_SUBREG (mode, reg, offset); } validate_subreg refuses to use a subreg to change the address of a pseudo that could be in memory (i.e. SI->HI on big endian). So... where is the bug or false assumption here? The test case is h8300-elf vs gcc.dg/20040310-1.c with "-O1 -msx" Thanks, DJ
Re: [patch] Improve loop array prefetch for IA-64
"Steven Bosscher" <[EMAIL PROTECTED]> writes: > On 6/2/06, Davis, Mark <[EMAIL PROTECTED]> wrote: > > Question: does gcc now know the difference between prefetching to cache L1 > > via > > "lfetch", as opposed to prefetching only to level L2 via "lfetch.nt1"? > > The ia64 backend knows the difference, see the prefetch pattern in ia64.md. > > But ia64 is the only backend that supports this kind of explicit > locality parameter. And since no-one from the ia64 community cared > much about gcc until recently, gcc's prefetching pass (which is > limited anyway) does not generate lfetch.nt1 or other prefetches with > explicit locality parameters. Actually SSE X86 has prefetches with different locality hints (T0, T1, T2, NTA) However x86 always needs to have the items in L1 cache to do anything with them even for FP data so it might not be very useful to do this particular optimization for it. T0 vs NTA is useful though and at least AMD K8 can make use of them - when data is streamed and not reused and there is a lot of it then NTA is a good idea. > > For floating point data, the latter is the only interesting case because > > float loads only > > access the L2. Thus using "lfetch" for floating point arrays will > > unnecessarily wipe out > the contents of L1. (gcc 3.2.3 only seems to > > generate "lfetch", which is why I ask...) > > You could experiment with this for ia64 by hacking issue_prefetch_ref > in tree-ssa-loop-prefetch.c to issue a prefetch to L2 for floating > point types. Perhaps it could generate different prefetches based on the array size being worked on? I guess e.g. for an 1MB array walk NTA is probably a good idea (with the 1MB being a tunable) -Andi