Re: missed optimization: transforming while(n>=1) into if(n>=1)

2011-05-21 Thread Siarhei Siamashka
On Sat, May 21, 2011 at 9:07 AM, Matt Turner  wrote:
> Hi,
>
> While trying to optimize pixman, I noticed that gcc is unable to
> recognize that 'while (n >= 1)' can often be simplified to 'if (n >=
> 1)'. Consider the following example, where there are loops that
> operate on larger amounts of data and smaller loops that deal with
> small or unaligned data.
>
> int sum(const int *l, int n)
> {
>    int s = 0;
>
>    while (n >= 2) {
>        s += l[0] + l[1];
>
>        l += 2;
>        n -= 2;
>    }
>
>    while (n >= 1) {
>        s += l[0];
>
>        l += 1;
>        n -= 1;
>    }
>
>    return s;
> }
>
> Clearly the while (n >= 1) loop can never execute more than once, as n
> must be < 2, and in the body of the loop, n is decremented.
>
> The resulting machine code includes the backward branch to the top of
> the while (n >= 1) loop, which can never be taken.
>
> I suppose this is a missed optimization. Is this known, or should I
> make a new bug report?

I have an old bugreport for a somewhat related problem:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37734

It's not difficult for modify C code and replace 'while' with 'if'
where appropriate. But in the end, the compilers still can perform
pessimization even if given enough hints about how to generate
efficient code (the result of expression used for comparison with
zero, which makes it obvious that the flags already set by arithmetic
instruction can be reused without emitting an extra comparison
instruction). Though it's somewhat more complicated when targeting
MIPS processors.

-- 
Best regards,
Siarhei Siamashka


RFC: ARM Cortex-A8 and floating point performance

2010-06-16 Thread Siarhei Siamashka
Hello,

Currently gcc (at least version 4.5.0) does a very poor job generating single 
precision floating point code for ARM Cortex-A8.

The source of this problem is the use of VFP instructions which are run on a 
slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on RunFast mode 
(flush denormals to zero, disable exceptions) just provides a relatively minor 
performance gain.

The right solution seems to be the use of NEON instructions for doing most of
the single precision calculations.

I wonder if it would be difficult to introduce the following changes to the 
gcc generated code when optimizing for cortex-a8:
1. Allocate single precision variables only to evenly or oddly numbered
s-registers.
2. Instead of using 'fadds s0, s0, s2' or similar instructions, do
'vadd.f32 d0, d0, d1' instead.

The number of single precision floating point registers gets effectively 
halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky
(packing/unpacking of register pairs may be needed to ensure proper parameters 
passing to functions). Also there may be other problems, like dealing with 
strict IEEE-754 compliance (maybe a special variable attribute for relaxing 
compliance requirements could be useful). But this looks like the only 
solution to fix poor performance on ARM Cortex-A8 processor.

Actually clang 2.7 seems to be working exactly this way. And it is
outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single precision
floating point tests that I tried on ARM Cortex-A8.

-- 
Best regards,
Siarhei Siamashka


Re: RFC: ARM Cortex-A8 and floating point performance

2010-06-17 Thread Siarhei Siamashka
On Wednesday 16 June 2010 15:22:32 Ramana Radhakrishnan wrote:
> On Wed, 2010-06-16 at 15:52 +0000, Siarhei Siamashka wrote:
> > Currently gcc (at least version 4.5.0) does a very poor job generating
> > single precision floating point code for ARM Cortex-A8.
> >
> > The source of this problem is the use of VFP instructions which are run
> > on a slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on
> > RunFast mode (flush denormals to zero, disable exceptions) just provides
> > a relatively minor performance gain.
> >
> > The right solution seems to be the use of NEON instructions for doing
> > most of the single precision calculations.
> 
> Only in situations that the user is aware about -ffast-math. I will
> point out that single precision floating point operations on NEON are
> not completely IEEE compliant.

Sure. The way how gcc deals with IEEE compliance in the generated code should 
be preferably consistent and clearly defined. That's why I reported the 
following problem earlier: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43703

Generating fast floating code for Cortex-A8 with -ffast-math option can be a 
good starting point. And ideally it would be nice to be able to mix IEEE 
compliant and non-IEEE compliant parts of code. Supporting something like this 
would be handy:

typedef __attribute__((ieee_noncompliant)) float fast_float;

For example, Qt defines its own 'qreal' type, which currently defaults to 
'float' for ARM and to 'double' for all the other architectures. Many 
applications are not that sensitive to strict IEEE compliance or even 
precision. But some of the applications and libraries do, so they need to be 
respected too.

But in any case, ARM Cortex-A8 has some hardware to do reasonably fast single 
precision floating point calculations (with some compliance issues). It makes 
a lot of sense to be able to utilize this hardware efficiently from a high 
level language such as C/C++ without rewriting tons of existing code. 

AFAIK x86 had its own bunch of issues with the 80-bit extended precision, when 
just 32-bit or 64-bit precision is needed.

By the way, I tried to experiment with solving/workarounding this floating 
point performance issue by making a C++ wrapper class, overloading operators 
and using neon intrinsics. It provided a nice speedup in some cases. But gcc 
still has troubles generating efficient code for neon intrinsics, and there 
were other issues like the size of this newly defined type, which make it not 
very practical overall.

-- 
Best regards,
Siarhei Siamashka