https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70686
--- Comment #2 from alekshs at hotmail dot com ---
(In reply to Richard Biener from comment #1)
> It's not so mind-blowing - it's simply that -fprofile-generate makes our
> GIMPLE level if-conversion no longer apply. Without -fprofile-generate
> we if-convert the loop into
>
> for (i = 1; i <100000001; i++)
> {
> ...
>
> b = b + (b < 1.00001) ? i + 12.43 : 0.0;
> ...
> }
>
> thus we always evaluate the i + 12.43 and one additional addition of zero.
>
> We do this to eventually enable vectorization but without any check
> on whether it would be profitable when not vectorizing (your testcase
> shows it's not profitable).
>
> Confirmed. -fno-tree-loop-if-convert should fix it in this particular case.
Aha, thanks for the swift reply.
Regarding profitability, I should note that the PGO misses entirely the fact
that 20 mulsd could become 10 mulpd:
400560: f2 0f 59 e9 mulsd %xmm1,%xmm5
400564: f2 0f 59 e1 mulsd %xmm1,%xmm4
400568: f2 0f 59 d9 mulsd %xmm1,%xmm3
40056c: f2 0f 59 d1 mulsd %xmm1,%xmm2
400570: f2 0f 59 e9 mulsd %xmm1,%xmm5
400574: f2 0f 59 e1 mulsd %xmm1,%xmm4
400578: f2 0f 59 d9 mulsd %xmm1,%xmm3
40057c: f2 0f 59 d1 mulsd %xmm1,%xmm2
400580: f2 0f 59 e9 mulsd %xmm1,%xmm5
400584: f2 0f 59 e1 mulsd %xmm1,%xmm4
400588: f2 0f 59 d9 mulsd %xmm1,%xmm3
40058c: f2 0f 59 d1 mulsd %xmm1,%xmm2
400590: f2 0f 59 e9 mulsd %xmm1,%xmm5
400594: f2 0f 59 e1 mulsd %xmm1,%xmm4
400598: f2 0f 59 d9 mulsd %xmm1,%xmm3
40059c: f2 0f 59 d1 mulsd %xmm1,%xmm2
4005a0: f2 0f 59 e9 mulsd %xmm1,%xmm5
4005a4: f2 0f 59 e1 mulsd %xmm1,%xmm4
4005a8: f2 0f 59 d9 mulsd %xmm1,%xmm3
4005ac: f2 0f 59 d1 mulsd %xmm1,%xmm2
...So there was job to be done there. That's at -03 -march=native btw (to
preserve accuracy, unlike -Ofast). Ofast too doesn't pack them. It kind of
splits to scalar muls and packed adds.
It's a similar situation with another such small benchmark I made where it was
doing 4 x sqrts all the time (with some stuff added when values got too low, so
as to keep going), but the 2x packed sqrts I did in asm were much faster than
the 4 scalar that gcc was generating (at every level of optimization and
profiling - it didn't do 2x packed... kept doing it 4x scalar). I'm attaching
the bench in the end.
It seems like gcc avoids packing instructions like the plague in non-array code
even when there are obvious and serious measurable benefits. Perhaps the
heuristics need some tune up for both profiled and non-profiled compilation.
-----
code of sqrtbench.c
-----
#include <math.h>
#include <stdio.h>
#include <time.h>
int main()
{
const double a = 911798473; // assigning some randomly chosen constants to
begin math functions
const double aa = 143314345;
const double aaa = 531432117;
const double aaaa = 343211418;
unsigned int i; //loop counter
double b; //variables that will be used for storing square roots
double bb;
double bbb;
double bbbb;
b = a; //assign some large values to the variables in order to start finding
square roots
bb = aa;
bbb = aaa;
bbbb = aaaa;
double score; // score
double time1; //how much time the program took
clock_t start, end; //stopwatch timers
start = clock();
for (i = 1; i <100000001; i++)
{
b=sqrt (b);
bb=sqrt(bb);
bbb=sqrt(bbb);
bbbb=sqrt(bbbb);
if (b <= 1.0000001) {b=b+i+12.432432432;}
if (bb <= 1.0000001) {bb=bb+i+15.4324442;}
if (bbb <= 1.0000001) {bbb=bbb+i+19.42884;}
if (bbbb <= 1.0000001) {bbbb=bbbb+i+34.481;}
}
end = clock();
time1 = ((double) (end - start)) / CLOCKS_PER_SEC * 1000;
score = (10000000 / time1); // Just a way to give a "score" insead of just
time elapsed.
// Baseline calibration is at 1000 points rewarded
for 10000ms delay...
// In other words if you finish 5 times faster, say
2000ms, you get 5000 points
printf("\nFinal number: %0.16f", (b+bb+bbb+bbbb)); // The number that
resulted from all the math functions - useful for checking math accuracy from
unsafe optimizations
if (b+bb+bbb+bbbb > 4.0000032938759028) {printf(" Result [INCORRECT -
4.0000032938759027 expected]");} //checking result
if (b+bb+bbb+bbbb < 4.0000032938759026) {printf(" Result [INCORRECT -
4.0000032938759027 expected]");} //checking result
printf("\nTime elapsed: %0.0f msecs", time1); // Time elapsed announced to
the user
printf("\nScore: %0.0f\n", score); // Score announced to the user
return 0;
}
-----end code ----
(the above generates, consistently, 4 sqrtsd instead of 2 sqrtpd, at -O3 and
PGO).