Re: GCC 4.3.4 is casting my QImode vars to SImode for libcall

2010-06-16 Thread Paolo Bonzini

On 06/15/2010 11:02 AM, Paulo J. Matos wrote:

Just noticed the following also in optabs.c:

   /* We can't do it with an insn, so use a library call.  But first ensure
  that the mode of TO is at least as wide as SImode, since those are the
  only library calls we know about.  */

   if (GET_MODE_SIZE (GET_MODE (to))<  GET_MODE_SIZE (SImode))
 {
   target = gen_reg_rtx (SImode);
   expand_fix (target, from, unsignedp);
 }

This comment provides some insight on to why gcc keeps converting to
SImode.


I think the comment dates back to before the introduction of conversion 
optabs.


Maybe the right thing to compare with is the size of an int in bytes?

Paolo


RFC: ARM Cortex-A8 and floating point performance

2010-06-16 Thread Siarhei Siamashka
Hello,

Currently gcc (at least version 4.5.0) does a very poor job generating single 
precision floating point code for ARM Cortex-A8.

The source of this problem is the use of VFP instructions which are run on a 
slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on RunFast mode 
(flush denormals to zero, disable exceptions) just provides a relatively minor 
performance gain.

The right solution seems to be the use of NEON instructions for doing most of
the single precision calculations.

I wonder if it would be difficult to introduce the following changes to the 
gcc generated code when optimizing for cortex-a8:
1. Allocate single precision variables only to evenly or oddly numbered
s-registers.
2. Instead of using 'fadds s0, s0, s2' or similar instructions, do
'vadd.f32 d0, d0, d1' instead.

The number of single precision floating point registers gets effectively 
halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky
(packing/unpacking of register pairs may be needed to ensure proper parameters 
passing to functions). Also there may be other problems, like dealing with 
strict IEEE-754 compliance (maybe a special variable attribute for relaxing 
compliance requirements could be useful). But this looks like the only 
solution to fix poor performance on ARM Cortex-A8 processor.

Actually clang 2.7 seems to be working exactly this way. And it is
outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single precision
floating point tests that I tried on ARM Cortex-A8.

-- 
Best regards,
Siarhei Siamashka


Re: RFC: ARM Cortex-A8 and floating point performance

2010-06-16 Thread Richard Guenther
On Wed, Jun 16, 2010 at 5:52 PM, Siarhei Siamashka
 wrote:
> Hello,
>
> Currently gcc (at least version 4.5.0) does a very poor job generating single
> precision floating point code for ARM Cortex-A8.
>
> The source of this problem is the use of VFP instructions which are run on a
> slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on RunFast mode
> (flush denormals to zero, disable exceptions) just provides a relatively minor
> performance gain.
>
> The right solution seems to be the use of NEON instructions for doing most of
> the single precision calculations.
>
> I wonder if it would be difficult to introduce the following changes to the
> gcc generated code when optimizing for cortex-a8:
> 1. Allocate single precision variables only to evenly or oddly numbered
> s-registers.
> 2. Instead of using 'fadds s0, s0, s2' or similar instructions, do
> 'vadd.f32 d0, d0, d1' instead.
>
> The number of single precision floating point registers gets effectively
> halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky
> (packing/unpacking of register pairs may be needed to ensure proper parameters
> passing to functions). Also there may be other problems, like dealing with
> strict IEEE-754 compliance (maybe a special variable attribute for relaxing
> compliance requirements could be useful). But this looks like the only
> solution to fix poor performance on ARM Cortex-A8 processor.
>
> Actually clang 2.7 seems to be working exactly this way. And it is
> outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single precision
> floating point tests that I tried on ARM Cortex-A8.

On i?86 we have -mfpmath={sse,x87}, I suppose you could add
-mfpmath=neon for arm (properly conflicting with -mfloat-abi=hard
and requiring neon support).

Richard.

> --
> Best regards,
> Siarhei Siamashka
>


Re: RFC: ARM Cortex-A8 and floating point performance

2010-06-16 Thread Andrew Pinski



Sent from my iPhone

On Jun 16, 2010, at 6:04 AM, Richard Guenther > wrote:



On Wed, Jun 16, 2010 at 5:52 PM, Siarhei Siamashka
 wrote:

Hello,

Currently gcc (at least version 4.5.0) does a very poor job  
generating single

precision floating point code for ARM Cortex-A8.

The source of this problem is the use of VFP instructions which are  
run on a
slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on  
RunFast mode
(flush denormals to zero, disable exceptions) just provides a  
relatively minor

performance gain.

The right solution seems to be the use of NEON instructions for  
doing most of

the single precision calculations.

I wonder if it would be difficult to introduce the following  
changes to the

gcc generated code when optimizing for cortex-a8:
1. Allocate single precision variables only to evenly or oddly  
numbered

s-registers.
2. Instead of using 'fadds s0, s0, s2' or similar instructions, do
'vadd.f32 d0, d0, d1' instead.

The number of single precision floating point registers gets  
effectively

halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky
(packing/unpacking of register pairs may be needed to ensure proper  
parameters
passing to functions). Also there may be other problems, like  
dealing with
strict IEEE-754 compliance (maybe a special variable attribute for  
relaxing
compliance requirements could be useful). But this looks like the  
only

solution to fix poor performance on ARM Cortex-A8 processor.

Actually clang 2.7 seems to be working exactly this way. And it is
outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single  
precision

floating point tests that I tried on ARM Cortex-A8.


On i?86 we have -mfpmath={sse,x87}, I suppose you could add
-mfpmath=neon for arm (properly conflicting with -mfloat-abi=hard
and requiring neon support).


Except unlike sse, neon does not fully support IEEE support. So this  
should only be done with -ffast-math :). The point that it is slow is  
not good enough to change it to be something that is wrong and fast.




Richard.


--
Best regards,
Siarhei Siamashka



RE: [RFC] Cleaning up the pass manager

2010-06-16 Thread Grigori Fursin
Hi Diego,

Thanks a lot for doing this! I was a bit sad not to be able to continue
this work on pass selection and reordering but I would really like to see GCC 
pass 
manager improved in the future. I also forwarded your email to the cTuning 
mailing list
in case some of the ICI/MILEPOST GCC/cTuning CC users would want to provide
more feedback. 

By the way, one of the main reasons why I started developing ICI many years ago
was to be able to query GCC to tell me all available passes and then just use 
arbitrary 
selection and order of them for the whole program (IPO/LTO) or per function 
similar to 
what I could easily do with SUIF in my past research on empirical optimizations 
and what can be easily done in LLVM now. However, implementing it was really 
not easy
because:

* We have non-trivial (and not always fully documented) association between 
flags and passes, 
i.e. if I turn on unroll flag which turns on several passes, I can't later 
reproduce 
exactly the same behavior if I do not use any GCC flags but just try to turn on 
associated passes through pass manager.

* I believe that original idea of the pass manager introduced in GCC 4.x was to 
keep
a simple linked list of passes that are executed in a given order ONLY through 
documented
functions (API) and that can be turned on or off through the attribute in the 
list - 
this was a great idea and was one of the reasons why I finally moved to GCC 
from Open64 in 2004. 
However, I was a bit surprised to see in GCC 4.some explicit if statements 
inside pass manager 
that enabled some passes (for LTO) - in my opinion, it kills the main strength 
of the pass manager 
and also resulted that we had troubles porting ICI to the new GCC 4.5. 

* Lack of a table with full dependency info for each pass that can tell you at 
each stage of 
compilation, which passes can be selected next. I started working on that at 
the end of last 
year to get such info semi-empirically and also through the associated 
attributes (we presented 
preliminary results at GROW'10: http://ctuning.org/dissemination/grow10-08.pdf 
section 3.1), 
however again it was just before I moved to the new job so I couldn't finish it 
...

* Well-known problem that we have some global variables shared between passes 
preventing arbitrary orders 

By the way, just to be clear, this is just a feedback based on the experience 
of my colleagues 
and myself and I do not want to say that these are the most important things 
for GCC right now
(though I think they are in a long term) or that someone should fix it 
particularly since right 
now personally I am not working in this area, so if someone thinks that it's 
not important/useless/obvious,
just skip it ;) ... I now see lots of effort going on to clean up GCC and to 
address some of the 
above issues so I think it's really great and I am sad that I can't help much 
at this stage. 
However, before moving to a new job, I released all the tools from my past 
research at cTuning.org
so maybe someone will find them useful to continue addressing the above issues 
...

Cheers,
Grigori




By the way, here is some very brief feedback about why I needed for my 
reseafrom the R&D we did
at the beginning of this year just before I moved to the new job:
* 
-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Diego 
Novillo
Sent: Tuesday, June 15, 2010 4:03 AM
To: gcc@gcc.gnu.org
Subject: [RFC] Cleaning up the pass manager

I have been thinking about doing some cleanups to the pass manager.
The goal would be to have the pass manager be the central driver of
every action done by the compiler.  In particular, the front ends
should make use of it and the callgraph manager, instead of the
twisted interactions we have now.

Additionally, I would like to (at some point) incorporate some/most of
the functionality provided by ICI
(http://ctuning.org/wiki/index.php/CTools:ICI).  I'm not advocating
for integrating all of ICI, but leave enough hooks so such
experimentations are easier to do.

Initially, I'm going for some low hanging fruit:

- Fields properties_required, properties_provided and
properties_destroyed should Mean Something other than asserting
whether they exist.
- Whatever doesn't exist before a pass, needs to be computed.
- Pass scheduling can be done by simply declaring a pass and
presenting it to the pass manager.  The property sets should be enough
for the PM to know where to schedule a pass.
- dump_file and dump_flags are no longer globals.

Are there any particular pain points that people are currently
experiencing that fit this?


Thanks.  Diego.



Re: RFC: ARM Cortex-A8 and floating point performance

2010-06-16 Thread Ramana Radhakrishnan

On Wed, 2010-06-16 at 15:52 +, Siarhei Siamashka wrote:
> Hello,
> 
> Currently gcc (at least version 4.5.0) does a very poor job generating single 
> precision floating point code for ARM Cortex-A8.
> 
> The source of this problem is the use of VFP instructions which are run on a 
> slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on RunFast mode 
> (flush denormals to zero, disable exceptions) just provides a relatively 
> minor 
> performance gain.

> The right solution seems to be the use of NEON instructions for doing most of
> the single precision calculations.

Only in situations that the user is aware about -ffast-math. I will
point out that single precision floating point operations on NEON are
not completely IEEE compliant. 

cheers
Ramana



DWARF Version 4 Released

2010-06-16 Thread Michael Eager

The final version of DWARF Version 4 is available
for download from http://dwarfstd.org.

--
Michael Eagerea...@eagercon.com
1960 Park Blvd., Palo Alto, CA 94306  650-325-8077


DWARF v4 .debug_line and .debug_frame formats

2010-06-16 Thread Roland McGrath
Are there any plans to make GCC and/or GAS emit the version 4 variants of
the .debug_line and/or .debug_frame formats?

The .debug_line version 4 format only adds the "maximum operations per
instruction" header field and associated logic, which is only meaningful
for VLIW machines (i.e. ia64--are there others?).  The old format is
specified such that it's always safe to use the new line-number program
operations without changing the header version field, so there is no real
reason to emit the new header format unless using the VLIW support.  But it
seems consistent with the rest of the behavior of -gdwarf-4 to emit the v4
format with that option.  I'd like to know when or if to expect ever to see
this format.

Similarly, the .debug_frame version 4 format only adds the address_size and
segment_size header fields.  I don't know if there are any GCC/GAS target
configurations that support segmented addresses for code so as to need
segment_size, or any that support using an address size other than that
implied by the ELF file class (or another container format's explicit or
implicit address size, or the architecture's implicit address size) so as
to need address_size.  But the same logic and questions apply as for
.debug_line even so.  OTOH, e.g. x86-64 -mcmodel=small could use
address_size 4 and save some space in the .debug_frame output.


Thanks,
Roland