> Vasanth <[EMAIL PROTECTED]>
> I am using powerpc-eabi-gcc (3.4.1) and trying to retarget it for a
> fully pipelined FPU. I have a DFA model for the FPU. I am looking at
> the code produced for a simple FIR algorithm (a loop iterating over an
> array, with a multiply-add operation per iteration). (I am not using
> the fused-madd)
>
> for (i = 0; i < 64; i++)
> accum = z[i] * h[i];
>
> I have the FIR loop partially unrolled, yet am not seeing the multiply
> from say, iteration i+1, overlapping with the multiply from iteration
> i. From the scheduling dumps, I do see that the compiler knows that
> each use of the multiply is incurring the full latency of the multiply
> instead of having reduced latency by pipelining in software. The adds
> are also completely linked by data flow and the compiler does not seem
> to be using temporary registers to be able to exploit executing some
> of the adds in parallel. Hence, each add is stalled on the previous
> add.
>
> fadds f5,f0,f8
> fadds f4,f5,f6
> fadds f2,f4,f11
> fadds f1,f2,f3
> fadds f11,f1,f13
>
> The register pressure is not very high. Registers f15-f31 are not used at
all.
To break the linkage between the adds, try to keep the original loop
(instead of partially unrolling it yourself) and use -funroll-loops
-fvariable-expansion-in-unroller --param
max-variable-expansions-in-unroller=8 (or some other number greater than 1
but small enough to avoid spills).
(see http://gcc.gnu.org/ml/gcc/2004-09/msg01554.html)
This too was introduced in GCC 4.0.
Ayal.
>
> My question is, am I expecting the wrong version of GCC to be doing
> this. I saw the following thread about SMS.
>
> http://gcc.gnu.org/ml/gcc/2003-09/msg00954.html
>
> that seems relevant. Would GCC 4.x be a better version for my
> requirement? If not, any ideas would be greatly appreciated.
>
> thanks in advance,
> Vasanth