gcc 4.0.0 generates slower code than gcc 3.4.3 for the BLAS "axpy" operation.
(This is no doubt specific to IA32, and perhaps also to the processor version.)
The program is below, here are the timing results:
gcc 3.4.3 gcc 4.0.0
Method cpu secs cpu secs
z[]=x[]+alpha*y[] 1.45 1.72
z[]=z[]+alpha*y[] 1.47 2.03
z[]=z[]+y[] 1.44 1.57
The second method is a common special case of the first,
so it is unfortunate that gcc 4 does poorly on it.
========
The program is in two files to defeat inlining: rzvaxpy.c and zvaxpy.c
and here is the script I used to compile/run them:
for m in METH1 METH2 METH3
do
for cc in gcc343 gcc400
do
$cc -march=i686 -O3 -D$m rzvaxpy.c zvaxpy.c
echo $cc $m `(time a.out)2>&1`
done
done
==== zvaxpy.c
void
zvaxpy(double *z, double *x, double *y, int n, double alpha)
{
int i;
#if defined(METH1)
for (i = 0; i < n; i++) z[i] = x[i] + alpha * y[i];
#elif defined(METH2)
for (i = 0; i < n; i++) z[i] = z[i] + alpha * y[i];
#else
for (i = 0; i < n; i++) z[i] = z[i] + y[i];
#endif
}
==== rzvaxpy.c
#include <stdio.h>
#define N 100
#define NITER ((300*1000*1000)/N)
double a[100], b[100];
extern void zvaxpy(double *, double *, double *, int, double);
int
main()
{
int i;
double sum;
for (i = 0; i < 100; i++) { a[i] = 0; b[i] = 1; }
for (i = 0; i < NITER; i++) zvaxpy(a,a, b, N, 1.1);
sum = 0; for (i = 0; i < N; i++) sum += a[i];
printf("sum %g\n", sum);
return 0;
}
--
Summary: i686 floating point performance 33% slower than gcc
3.4.3
Product: gcc
Version: 4.0.0
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: trt at acm dot org
CC: gcc-bugs at gcc dot gnu dot org
GCC target triplet: i686-pc-linux-gnu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21550