Re: GCC 4.3.4 is casting my QImode vars to SImode for libcall
On 06/15/2010 11:02 AM, Paulo J. Matos wrote: Just noticed the following also in optabs.c: /* We can't do it with an insn, so use a library call. But first ensure that the mode of TO is at least as wide as SImode, since those are the only library calls we know about. */ if (GET_MODE_SIZE (GET_MODE (to))< GET_MODE_SIZE (SImode)) { target = gen_reg_rtx (SImode); expand_fix (target, from, unsignedp); } This comment provides some insight on to why gcc keeps converting to SImode. I think the comment dates back to before the introduction of conversion optabs. Maybe the right thing to compare with is the size of an int in bytes? Paolo
RFC: ARM Cortex-A8 and floating point performance
Hello, Currently gcc (at least version 4.5.0) does a very poor job generating single precision floating point code for ARM Cortex-A8. The source of this problem is the use of VFP instructions which are run on a slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on RunFast mode (flush denormals to zero, disable exceptions) just provides a relatively minor performance gain. The right solution seems to be the use of NEON instructions for doing most of the single precision calculations. I wonder if it would be difficult to introduce the following changes to the gcc generated code when optimizing for cortex-a8: 1. Allocate single precision variables only to evenly or oddly numbered s-registers. 2. Instead of using 'fadds s0, s0, s2' or similar instructions, do 'vadd.f32 d0, d0, d1' instead. The number of single precision floating point registers gets effectively halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky (packing/unpacking of register pairs may be needed to ensure proper parameters passing to functions). Also there may be other problems, like dealing with strict IEEE-754 compliance (maybe a special variable attribute for relaxing compliance requirements could be useful). But this looks like the only solution to fix poor performance on ARM Cortex-A8 processor. Actually clang 2.7 seems to be working exactly this way. And it is outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single precision floating point tests that I tried on ARM Cortex-A8. -- Best regards, Siarhei Siamashka
Re: RFC: ARM Cortex-A8 and floating point performance
On Wed, Jun 16, 2010 at 5:52 PM, Siarhei Siamashka wrote: > Hello, > > Currently gcc (at least version 4.5.0) does a very poor job generating single > precision floating point code for ARM Cortex-A8. > > The source of this problem is the use of VFP instructions which are run on a > slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on RunFast mode > (flush denormals to zero, disable exceptions) just provides a relatively minor > performance gain. > > The right solution seems to be the use of NEON instructions for doing most of > the single precision calculations. > > I wonder if it would be difficult to introduce the following changes to the > gcc generated code when optimizing for cortex-a8: > 1. Allocate single precision variables only to evenly or oddly numbered > s-registers. > 2. Instead of using 'fadds s0, s0, s2' or similar instructions, do > 'vadd.f32 d0, d0, d1' instead. > > The number of single precision floating point registers gets effectively > halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky > (packing/unpacking of register pairs may be needed to ensure proper parameters > passing to functions). Also there may be other problems, like dealing with > strict IEEE-754 compliance (maybe a special variable attribute for relaxing > compliance requirements could be useful). But this looks like the only > solution to fix poor performance on ARM Cortex-A8 processor. > > Actually clang 2.7 seems to be working exactly this way. And it is > outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single precision > floating point tests that I tried on ARM Cortex-A8. On i?86 we have -mfpmath={sse,x87}, I suppose you could add -mfpmath=neon for arm (properly conflicting with -mfloat-abi=hard and requiring neon support). Richard. > -- > Best regards, > Siarhei Siamashka >
Re: RFC: ARM Cortex-A8 and floating point performance
Sent from my iPhone On Jun 16, 2010, at 6:04 AM, Richard Guenther > wrote: On Wed, Jun 16, 2010 at 5:52 PM, Siarhei Siamashka wrote: Hello, Currently gcc (at least version 4.5.0) does a very poor job generating single precision floating point code for ARM Cortex-A8. The source of this problem is the use of VFP instructions which are run on a slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on RunFast mode (flush denormals to zero, disable exceptions) just provides a relatively minor performance gain. The right solution seems to be the use of NEON instructions for doing most of the single precision calculations. I wonder if it would be difficult to introduce the following changes to the gcc generated code when optimizing for cortex-a8: 1. Allocate single precision variables only to evenly or oddly numbered s-registers. 2. Instead of using 'fadds s0, s0, s2' or similar instructions, do 'vadd.f32 d0, d0, d1' instead. The number of single precision floating point registers gets effectively halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky (packing/unpacking of register pairs may be needed to ensure proper parameters passing to functions). Also there may be other problems, like dealing with strict IEEE-754 compliance (maybe a special variable attribute for relaxing compliance requirements could be useful). But this looks like the only solution to fix poor performance on ARM Cortex-A8 processor. Actually clang 2.7 seems to be working exactly this way. And it is outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single precision floating point tests that I tried on ARM Cortex-A8. On i?86 we have -mfpmath={sse,x87}, I suppose you could add -mfpmath=neon for arm (properly conflicting with -mfloat-abi=hard and requiring neon support). Except unlike sse, neon does not fully support IEEE support. So this should only be done with -ffast-math :). The point that it is slow is not good enough to change it to be something that is wrong and fast. Richard. -- Best regards, Siarhei Siamashka
RE: [RFC] Cleaning up the pass manager
Hi Diego, Thanks a lot for doing this! I was a bit sad not to be able to continue this work on pass selection and reordering but I would really like to see GCC pass manager improved in the future. I also forwarded your email to the cTuning mailing list in case some of the ICI/MILEPOST GCC/cTuning CC users would want to provide more feedback. By the way, one of the main reasons why I started developing ICI many years ago was to be able to query GCC to tell me all available passes and then just use arbitrary selection and order of them for the whole program (IPO/LTO) or per function similar to what I could easily do with SUIF in my past research on empirical optimizations and what can be easily done in LLVM now. However, implementing it was really not easy because: * We have non-trivial (and not always fully documented) association between flags and passes, i.e. if I turn on unroll flag which turns on several passes, I can't later reproduce exactly the same behavior if I do not use any GCC flags but just try to turn on associated passes through pass manager. * I believe that original idea of the pass manager introduced in GCC 4.x was to keep a simple linked list of passes that are executed in a given order ONLY through documented functions (API) and that can be turned on or off through the attribute in the list - this was a great idea and was one of the reasons why I finally moved to GCC from Open64 in 2004. However, I was a bit surprised to see in GCC 4.some explicit if statements inside pass manager that enabled some passes (for LTO) - in my opinion, it kills the main strength of the pass manager and also resulted that we had troubles porting ICI to the new GCC 4.5. * Lack of a table with full dependency info for each pass that can tell you at each stage of compilation, which passes can be selected next. I started working on that at the end of last year to get such info semi-empirically and also through the associated attributes (we presented preliminary results at GROW'10: http://ctuning.org/dissemination/grow10-08.pdf section 3.1), however again it was just before I moved to the new job so I couldn't finish it ... * Well-known problem that we have some global variables shared between passes preventing arbitrary orders By the way, just to be clear, this is just a feedback based on the experience of my colleagues and myself and I do not want to say that these are the most important things for GCC right now (though I think they are in a long term) or that someone should fix it particularly since right now personally I am not working in this area, so if someone thinks that it's not important/useless/obvious, just skip it ;) ... I now see lots of effort going on to clean up GCC and to address some of the above issues so I think it's really great and I am sad that I can't help much at this stage. However, before moving to a new job, I released all the tools from my past research at cTuning.org so maybe someone will find them useful to continue addressing the above issues ... Cheers, Grigori By the way, here is some very brief feedback about why I needed for my reseafrom the R&D we did at the beginning of this year just before I moved to the new job: * -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Diego Novillo Sent: Tuesday, June 15, 2010 4:03 AM To: gcc@gcc.gnu.org Subject: [RFC] Cleaning up the pass manager I have been thinking about doing some cleanups to the pass manager. The goal would be to have the pass manager be the central driver of every action done by the compiler. In particular, the front ends should make use of it and the callgraph manager, instead of the twisted interactions we have now. Additionally, I would like to (at some point) incorporate some/most of the functionality provided by ICI (http://ctuning.org/wiki/index.php/CTools:ICI). I'm not advocating for integrating all of ICI, but leave enough hooks so such experimentations are easier to do. Initially, I'm going for some low hanging fruit: - Fields properties_required, properties_provided and properties_destroyed should Mean Something other than asserting whether they exist. - Whatever doesn't exist before a pass, needs to be computed. - Pass scheduling can be done by simply declaring a pass and presenting it to the pass manager. The property sets should be enough for the PM to know where to schedule a pass. - dump_file and dump_flags are no longer globals. Are there any particular pain points that people are currently experiencing that fit this? Thanks. Diego.
Re: RFC: ARM Cortex-A8 and floating point performance
On Wed, 2010-06-16 at 15:52 +, Siarhei Siamashka wrote: > Hello, > > Currently gcc (at least version 4.5.0) does a very poor job generating single > precision floating point code for ARM Cortex-A8. > > The source of this problem is the use of VFP instructions which are run on a > slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on RunFast mode > (flush denormals to zero, disable exceptions) just provides a relatively > minor > performance gain. > The right solution seems to be the use of NEON instructions for doing most of > the single precision calculations. Only in situations that the user is aware about -ffast-math. I will point out that single precision floating point operations on NEON are not completely IEEE compliant. cheers Ramana
DWARF Version 4 Released
The final version of DWARF Version 4 is available for download from http://dwarfstd.org. -- Michael Eagerea...@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077
DWARF v4 .debug_line and .debug_frame formats
Are there any plans to make GCC and/or GAS emit the version 4 variants of the .debug_line and/or .debug_frame formats? The .debug_line version 4 format only adds the "maximum operations per instruction" header field and associated logic, which is only meaningful for VLIW machines (i.e. ia64--are there others?). The old format is specified such that it's always safe to use the new line-number program operations without changing the header version field, so there is no real reason to emit the new header format unless using the VLIW support. But it seems consistent with the rest of the behavior of -gdwarf-4 to emit the v4 format with that option. I'd like to know when or if to expect ever to see this format. Similarly, the .debug_frame version 4 format only adds the address_size and segment_size header fields. I don't know if there are any GCC/GAS target configurations that support segmented addresses for code so as to need segment_size, or any that support using an address size other than that implied by the ELF file class (or another container format's explicit or implicit address size, or the architecture's implicit address size) so as to need address_size. But the same logic and questions apply as for .debug_line even so. OTOH, e.g. x86-64 -mcmodel=small could use address_size 4 and save some space in the .debug_frame output. Thanks, Roland