https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70686
Bug ID: 70686
Summary: -fprofile-generate (not fprofile-use) somehow produces
much faster binary
Product: gcc
Version: 5.3.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: alekshs at hotmail dot com
Target Milestone: ---
I have this small benchmark that does 100mn loops of 20 divisions by 2.
Periodically it bumps up the values so that it continues to have something to
divide /2. I time this and see the results.
-------------------------------
<start of code>
#include <math.h>
#include <stdio.h>
#include <time.h>
int main()
{
printf("\n");
const double a = 3333333.3456743289; //initial randomly assigned values to
start halving
const double aa = 4444555.444334244;
const double aaa = 6666777.66666666;
const double aaaa = 32769999.123458;
unsigned int i;
double score;
double g; //the number to be used for making the divisions, so essentially
halving everything each round
double b;
double bb;
double bbb;
double bbbb;
g = 2;
b = a;
bb = aa;
bbb = aaa;
bbbb = aaaa;
double total_time;
clock_t start, end;
start = 0;
end = 0;
score = 0;
start = clock();
for (i = 1; i <100000001; i++)
{
b=b/g;
b=b/g;
b=b/g;
b=b/g;
b=b/g;
bb=bb/g;
bb=bb/g;
bb=bb/g;
bb=bb/g;
bb=bb/g;
bbb=bbb/g;
bbb=bbb/g;
bbb=bbb/g;
bbb=bbb/g;
bbb=bbb/g;
bbbb=bbbb/g;
bbbb=bbbb/g;
bbbb=bbbb/g;
bbbb=bbbb/g;
bbbb=bbbb/g;
if (b < 1.0000001) {b=b+i+12.432432432;} //just adding more stuff in
order for the number
if (bb < 1.0000001) {bb=bb+i+15.4324442;} //to return back to larger
values after several
if (bbb < 1.0000001) {bbb=bbb+i+19.42884;} //rounds of halving
if (bbbb < 1.0000001) {bbbb=bbbb+i+34.481;}
}
end = clock();
total_time = ((double) (end - start)) / CLOCKS_PER_SEC * 1000;
score = (10000000 / total_time);
printf("\nFinal number: %0.20f", (b+bb+bbb+bbbb));
printf("\nTime elapsed: %0.0f msecs", total_time);
printf("\nScore: %0.0f\n", score);
return 0;
}
-----------------------
<end of code>
This is run on a quad q8200 @ 1.75ghz
Now notice the times:
gcc Maths4asm.c -lm -O0 => 6224ms
gcc Maths4asm.c -lm -O2 and -O3 => 1527ms
gcc Maths4asm.c -lm -Ofast => 1227ms
gcc Maths4asm.c -lm -Ofast -march=nocona => 1236ms
gcc Maths4asm.c -lm -Ofast -march=core2 => 1197ms (I have a core quad,
technically it's core2 arch)
gcc Maths4asm.c -lm -Ofast -march=core2 -fprofile-generate => 624ms.
gcc Maths4asm.c -lm -Ofast -march=nocona -fprofile-generate => 530ms.
gcc Maths4asm.c -lm -Ofast -march=nocona -fprofile-use => 1258ms (slower than
without PGO, slower than -fprofile-generate)
gcc Maths4asm.c -lm -Ofast -march=core2 -fprofile-use => 1222ms (slower than
without PGO, slower than -fprofile-generate).
So PGO optimization made it worse, but the most mindblowing thing is the
running of the profiler, getting execution times down to 530ms. The profiler
run (-generate) should normally take this to 4000-5000ms or above, as it
monitors the process to create a log file.
I have never run into a -fprofile-generate build that wasn't at least 2-3 times
slower than a normal build - let alone 2-3 times faster. This is totally
absurd.
And then, to top it all, -fprofile-use (using the logfile to create the best
binary) created worse binaries.
Oh, and "nocona" (pentium4+) suddenly became ...the better architecture instead
of core2.
This stuff is almost unbelievable. I thought initially that the profiler must
be activating multithreading, but no. I scripted simultaneous use of 4 runs,
they all give the same time - that means, there was no extra cpu use in other
threads.
The implication of all these is that if -fprofile-generate can somehow give
code that is executing at 500ms, and the non -fprofile-generate binaries are
running at 1200ms, then serious performance is left on the table on normal
builds.